Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-09 Thread Muhammad Sharfuddin

On 3/10/2018 10:00 AM, Andrei Borzenkov wrote:

09.03.2018 19:55, Muhammad Sharfuddin пишет:

Hi,

This two node cluster starts resources when both nodes are online but
does not start the ocfs2 resources

when one node is offline. e.g if I gracefully stop the cluster resources
then stop the pacemaker service on

either node, and try to start the ocfs2 resource on the online node, it
fails.

logs:

pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2
(ref=pe_calc-dc-1520613202-31) derived from
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
pipci001
lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0
(ok)
crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6
locally on pipci001
crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
locally on pipci001
lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
Filesystem(p-fssapmnt)[19052]: INFO: Running start for
/dev/mapper/sapmnt on /sapmnt
kernel: [ 4576.529938] dlm: Using TCP for communications
kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
the lockspace group.
dlm_controld[19019]: 4629 fence work wait for quorum
dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out

That sounds like the problem. It attempts to fence the other node, but
you do not have any fencing resources configured so it cannot work. You
need to ensure you have working fencing agent in your configuration.
sbd is being perfectly used in this cluster and after multiple failed 
attempts to start the ocfs2

resource, this standalone online node gets fenced too

logs:
pengine[17732]:  warning: Scheduling Node pipci001 for STONITH
pengine[17732]:   notice: Stop of failed resource dlm:0 is implicit 
after pipci001 is fenced

pengine[17732]:   notice:  * Fence pipci001
pengine[17732]:   notice: Stop    sbd-stonith#011(pipci001)
pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:  warning: Calculated transition 6 (with warnings), 
saving inputs in /var/lib/pacemaker/pengine/pe-warn-15.bz2
2018-03-09T21:03:30.588865+05:00 pipci002 crmd[13030]:   notice: 
Processing graph 6 (ref=pe_calc-dc-1520611410-34) derived from 
/var/lib/pacemaker/pengine/pe-warn-15.bz2

crmd[17733]:   notice: Requesting fencing (reboot) of node pipci001
stonith-ng[13026]:   notice: Client crmd.13030.f5570444 wants to fence 
(reboot) 'pipci001' with device '(any)'

stonith-ng[13026]:   notice: Requesting peer fencing (reboot) of pipci001
stonith-ng[13026]:   notice: sbd-stonith can fence (rebo

Also as informed this cluster starts resources when both nodes are 
online and stonith is enabled

and works too.

cluster properties:
property cib-bootstrap-options: \
    have-watchdog=true \
    stonith-enabled=true \
    stonith-timeout=80 \
    startup-fencing=true \



kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
event done -512 0
kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
failed -512 0
lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
crmd[17733]:    error: Result of start operation for p-fssapmnt on
pipci001: Timed Out
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition aborted by operation
p-fssapmnt_start_0 'modify' on pipci001: Event failed
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
Skipped=0, Incomplete=6,
Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
100 failures (max=2)
pengine[17732]:  warning: Forcing 

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-09 Thread Andrei Borzenkov
09.03.2018 19:55, Muhammad Sharfuddin пишет:
> Hi,
> 
> This two node cluster starts resources when both nodes are online but
> does not start the ocfs2 resources
> 
> when one node is offline. e.g if I gracefully stop the cluster resources
> then stop the pacemaker service on
> 
> either node, and try to start the ocfs2 resource on the online node, it
> fails.
> 
> logs:
> 
> pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
> pipci001 pengine[17732]:   notice: Calculated transition 2, saving
> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
> pipci001 crmd[17733]:   notice: Processing graph 2
> (ref=pe_calc-dc-1520613202-31) derived from
> /var/lib/pacemaker/pengine/pe-input-339.bz2
> crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
> pipci001
> lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
> lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
> crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0
> (ok)
> crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6
> locally on pipci001
> crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
> locally on pipci001
> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
> /dev/mapper/sapmnt on /sapmnt
> kernel: [ 4576.529938] dlm: Using TCP for communications
> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
> the lockspace group.
> dlm_controld[19019]: 4629 fence work wait for quorum
> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out

That sounds like the problem. It attempts to fence the other node, but
you do not have any fencing resources configured so it cannot work. You
need to ensure you have working fencing agent in your configuration.

> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
> event done -512 0
> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
> failed -512 0
> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
> crmd[17733]:    error: Result of start operation for p-fssapmnt on
> pipci001: Timed Out
> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> crmd[17733]:   notice: Transition aborted by operation
> p-fssapmnt_start_0 'modify' on pipci001: Event failed
> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
> Skipped=0, Incomplete=6,
> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
> required
> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
> 100 failures (max=2)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
> 100 failures (max=2)
> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Calculated transition 3, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-340.bz2
> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
> required
> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
> 100 failures (max=2)
> pipci001 pengine[17732]:  warning: Forcing base-clone away from pipci001
> after 100 failures (max=2)
> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Calculated transition 4, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-341.bz2
> crmd[17733]:   notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36)
> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
> crmd[17733]:   notice: Initiating stop operation p-fssapmnt_stop_0
> 

Re: [ClusterLabs] [Problem]The pengine core dumps when changing attributes of bundle.

2018-03-09 Thread Ken Gaillot
On Sat, 2018-03-10 at 05:47 +0900, renayama19661...@ybb.ne.jp wrote:
> Hi All, 
> 
> [Sorry..There was a defect in line breaks. to send again.]
> 
> I was checking the operation of Bundle with Pacemaker version 2.0.0-
> 9cd0f6cb86. 
> When Bundle resource is configured in Pacemaker and attribute is
> changed, pengine core dumps. 

Hi Hideo,

At first glance, it's confusing. The backtrace shows that
find_container_child() is being called with a NULL rsc, but I don't see
how it's possible to call it that way.

We'll investigate further and get back on the BZ

> 
> Step1) Start Pacemaker and pour in the settings. (The replicas and
> replicas-per-host are set to 1.) 
> 
> [root@rh74-test ~]# cibadmin --modify --allow-create --scope
> resources -X '
>   replicas-per-host="1" options="--log-driver=journald" />  range-start="192.168.20.188" host-interface="ens192" host-
> netmask="24">  
>   root="/var/local/containers" target-dir="/var/www/html"
> options="rw"/>  root="/var/log/pacemaker/bundles" target-dir="/etc/httpd/logs"
> options="rw"/>   provider="heartbeat" type="apache" >id="rabbitmq-start-interval-0s" interval="0s" name="start"
> timeout="200s"/>  name="stop" timeout="200s" on-fail="fence" /> 
> 
> ' 
> 
> Step2) Bundle is configured. 
> 
> [root@rh74-test ~]# crm_mon -1 -Af
> Stack: corosync
> Current DC: rh74-test (version 2.0.0-9cd0f6cb86) - partition WITHOUT
> quorum
> Last updated: Fri Mar  9 10:09:20 2018
> Last change: Fri Mar  9 10:06:30 2018 by root via cibadmin on rh74-
> test 2 nodes configured
> 
> 4 resources configured Online: [ rh74-test ]
> GuestOnline: [ httpd-bundle-0@rh74-test ] 
> 
> Active resources: 
> Docker container: httpd-bundle [pcmktest:http] httpd-bundle-0
> (192.168.20.188)      (ocf::heartbeat:apache):        
> 
> Started rh74-test Node Attributes:
> * Node httpd-bundle-0@rh74-test:
> * Node rh74-test: Migration Summary:
> * Node rh74-test:
> * Node httpd-bundle-0@rh74-test: 
> 
> Step3) Change attributes of bundle with cibadmin command. (The
> replicas and replicas-per-host change to 3.)
> 
> 
> [root@rh74-test ~]# cibadmin --modify -X ' image="pcmktest:http" replicas="3" replicas-per-host="3" options="
> --log-driver=journald"/>' 
> 
> Step4) The pengine will core dump. (snip)
> Mar  9 10:10:21 rh74-test pengine[17726]:  notice: On loss of quorum:
> Ignore
> Mar  9 10:10:21 rh74-test pengine[17726]:    info: Node rh74-test is
> online
> Mar  9 10:10:21 rh74-test crmd[17727]:  error: Connection to pengine
> failed
> Mar  9 10:10:21 rh74-test crmd[17727]:  error: Connection to
> pengine[0x55f2d068bfb0] closed (I/O condition=25)
> Mar  9 10:10:21 rh74-test pacemakerd[17719]:  error: Managed process
> 17726 (pengine) dumped core
> Mar  9 10:10:21 rh74-test pacemakerd[17719]:  error: pengine[17726]
> terminated with signal 11 (core=1)
> Mar  9 10:10:21 rh74-test pacemakerd[17719]:  notice: Respawning
> failed child process: pengine
> Mar  9 10:10:21 rh74-test pacemakerd[17719]:    info: Using uid=990
> and group=984 for process pengine
> Mar  9 10:10:21 rh74-test pacemakerd[17719]:    info: Forked child
> 19275 for process pengine
> (snip) 
> 
> This event reproduces 100 percent. 
> 
> Apparently the problem seems to be due to different handling of
> clone(httpd) resources in the Bundle resource. 
> 
> - I registered this content with the following Bugzilla.
> (https://bugs.clusterlabs.org/show_bug.cgi?id=5337)
> 
> Best Regards
> Hideo Yamauchi.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] page not found

2018-03-09 Thread Kevin Martin

or these links:

http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/ch06.html

http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_specifying_a_preferred_location.html

http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_ensuring_resources_run_on_the_same_host.html

http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_controlling_resource_start_stop_ordering.html

http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html


Cluster down?

Regards,

Oracle 
Kevin Martin | Principal System Administrator, Cloud Operations | 
+1.847.245.0654

Oracle Production Engineering and Operations
570 Lake Cook Rd., Suite 200
Deerfield, IL 60015
Green Oracle  Oracle is committed to 
developing practices and products that help protect the environment


On 3/9/2018 11:30 AM, Kevin Martin wrote:


http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/ch09.html 
from the link on page http://clusterlabs.org/quickstart-redhat.html.


--
Regards,

Oracle 
Kevin Martin | Principal System Administrator, Cloud Operations | 
+1.847.245.0654

Oracle Production Engineering and Operations
570 Lake Cook Rd., Suite 200
Deerfield, IL 60015
Green Oracle  Oracle is committed to 
developing practices and products that help protect the environment




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] page not found

2018-03-09 Thread Kevin Martin
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/ch09.html 
from the link on page http://clusterlabs.org/quickstart-redhat.html.


--
Regards,

Oracle 
Kevin Martin | Principal System Administrator, Cloud Operations | 
+1.847.245.0654

Oracle Production Engineering and Operations
570 Lake Cook Rd., Suite 200
Deerfield, IL 60015
Green Oracle  Oracle is committed to 
developing practices and products that help protect the environment


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Problem]The pengine core dumps when changing attributes of bundle.

2018-03-09 Thread renayama19661014
Hi All, 

[Sorry..There was a defect in line breaks. to send again.]

I was checking the operation of Bundle with Pacemaker version 2.0.0-9cd0f6cb86. 
When Bundle resource is configured in Pacemaker and attribute is changed, 
pengine core dumps. 

Step1) Start Pacemaker and pour in the settings. (The replicas and 
replicas-per-host are set to 1.) 

[root@rh74-test ~]# cibadmin --modify --allow-create --scope resources -X '
   
 
 
  
' 

Step2) Bundle is configured. 

[root@rh74-test ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh74-test (version 2.0.0-9cd0f6cb86) - partition WITHOUT quorum
Last updated: Fri Mar  9 10:09:20 2018
Last change: Fri Mar  9 10:06:30 2018 by root via cibadmin on rh74-test 2 nodes 
configured

4 resources configured Online: [ rh74-test ]
GuestOnline: [ httpd-bundle-0@rh74-test ] 

Active resources: 
Docker container: httpd-bundle [pcmktest:http] httpd-bundle-0 (192.168.20.188)  
    (ocf::heartbeat:apache):        

Started rh74-test Node Attributes:
* Node httpd-bundle-0@rh74-test:
* Node rh74-test: Migration Summary:
* Node rh74-test:
* Node httpd-bundle-0@rh74-test: 

Step3) Change attributes of bundle with cibadmin command. (The replicas and 
replicas-per-host change to 3.)


[root@rh74-test ~]# cibadmin --modify -X '' 

Step4) The pengine will core dump. (snip)
Mar  9 10:10:21 rh74-test pengine[17726]:  notice: On loss of quorum: Ignore
Mar  9 10:10:21 rh74-test pengine[17726]:    info: Node rh74-test is online
Mar  9 10:10:21 rh74-test crmd[17727]:  error: Connection to pengine failed
Mar  9 10:10:21 rh74-test crmd[17727]:  error: Connection to 
pengine[0x55f2d068bfb0] closed (I/O condition=25)
Mar  9 10:10:21 rh74-test pacemakerd[17719]:  error: Managed process 17726 
(pengine) dumped core
Mar  9 10:10:21 rh74-test pacemakerd[17719]:  error: pengine[17726] terminated 
with signal 11 (core=1)
Mar  9 10:10:21 rh74-test pacemakerd[17719]:  notice: Respawning failed child 
process: pengine
Mar  9 10:10:21 rh74-test pacemakerd[17719]:    info: Using uid=990 and 
group=984 for process pengine
Mar  9 10:10:21 rh74-test pacemakerd[17719]:    info: Forked child 19275 for 
process pengine
(snip) 

This event reproduces 100 percent. 

Apparently the problem seems to be due to different handling of clone(httpd) 
resources in the Bundle resource. 

- I registered this content with the following Bugzilla.
(https://bugs.clusterlabs.org/show_bug.cgi?id=5337)

Best Regards
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [Problem]The pengine core dumps when changing attributes of bundle.

2018-03-09 Thread renayama19661014
Hi All, I was checking the operation of Bundle with Pacemaker version 
2.0.0-9cd0f6cb86. When Bundle resource is configured in Pacemaker and attribute 
is changed, pengine core dumps. Step1) Start Pacemaker and pour in the 
settings. (The replicas and replicas-per-host are set to 1.) [root@rh74-test 
~]# cibadmin --modify --allow-create --scope resources -X '
   
 
 
  
' Step2) Bundle is configured. [root@rh74-test ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh74-test (version 2.0.0-9cd0f6cb86) - partition WITHOUT quorum
Last updated: Fri Mar  9 10:09:20 2018
Last change: Fri Mar  9 10:06:30 2018 by root via cibadmin on rh74-test 2 nodes 
configured
4 resources configured Online: [ rh74-test ]
GuestOnline: [ httpd-bundle-0@rh74-test ] Active resources: Docker container: 
httpd-bundle [pcmktest:http] httpd-bundle-0 (192.168.20.188)  
(ocf::heartbeat:apache):Started rh74-test Node Attributes:
* Node httpd-bundle-0@rh74-test:
* Node rh74-test: Migration Summary:
* Node rh74-test:
* Node httpd-bundle-0@rh74-test: Step3) Change attributes of bundle with 
cibadmin command. (The replicas and replicas-per-host change to 3.)
[root@rh74-test ~]# cibadmin --modify -X '' Step4) 
The pengine will core dump. (snip)
Mar  9 10:10:21 rh74-test pengine[17726]:  notice: On loss of quorum: Ignore
Mar  9 10:10:21 rh74-test pengine[17726]:info: Node rh74-test is online
Mar  9 10:10:21 rh74-test crmd[17727]:   error: Connection to pengine failed
Mar  9 10:10:21 rh74-test crmd[17727]:   error: Connection to 
pengine[0x55f2d068bfb0] closed (I/O condition=25)
Mar  9 10:10:21 rh74-test pacemakerd[17719]:   error: Managed process 17726 
(pengine) dumped core
Mar  9 10:10:21 rh74-test pacemakerd[17719]:   error: pengine[17726] terminated 
with signal 11 (core=1)
Mar  9 10:10:21 rh74-test pacemakerd[17719]:  notice: Respawning failed child 
process: pengine
Mar  9 10:10:21 rh74-test pacemakerd[17719]:info: Using uid=990 and 
group=984 for process pengine
Mar  9 10:10:21 rh74-test pacemakerd[17719]:info: Forked child 19275 for 
process pengine
(snip) This event reproduces 100 percent. Apparently the problem seems to be 
due to different handling of clone(httpd) resources in the Bundle resource. 

- I registered this content with the following Bugzilla.
(https://bugs.clusterlabs.org/show_bug.cgi?id=5337)
Best Regards
Hideo Yamauchi.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] single node fails to start the ocfs2 resource

2018-03-09 Thread Muhammad Sharfuddin

Hi,

This two node cluster starts resources when both nodes are online but 
does not start the ocfs2 resources


when one node is offline. e.g if I gracefully stop the cluster resources 
then stop the pacemaker service on


either node, and try to start the ocfs2 resource on the online node, it 
fails.


logs:

pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving 
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2 
(ref=pe_calc-dc-1520613202-31) derived from 
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on 
pipci001

lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69 
pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms

crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0 (ok)
crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6 
locally on pipci001
crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0 
locally on pipci001

lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
Filesystem(p-fssapmnt)[19052]: INFO: Running start for 
/dev/mapper/sapmnt on /sapmnt

kernel: [ 4576.529938] dlm: Using TCP for communications
kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining 
the lockspace group.

dlm_controld[19019]: 4629 fence work wait for quorum
dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out
kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group 
event done -512 0
kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join 
failed -512 0

lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71 
pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms

kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
crmd[17733]:    error: Result of start operation for p-fssapmnt on 
pipci001: Timed Out
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed 
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition aborted by operation 
p-fssapmnt_start_0 'modify' on pipci001: Event failed
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed 
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0, 
Skipped=0, Incomplete=6, 
Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is 
required

pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on 
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on 
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after 
100 failures (max=2)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after 
100 failures (max=2)

pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Calculated transition 3, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-340.bz2
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is 
required

pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on 
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on 
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after 
100 failures (max=2)
pipci001 pengine[17732]:  warning: Forcing base-clone away from pipci001 
after 100 failures (max=2)

pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Calculated transition 4, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-341.bz2
crmd[17733]:   notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36) 
derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
crmd[17733]:   notice: Initiating stop operation p-fssapmnt_stop_0 
locally on pipci001

lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:stop call_id:72
Filesystem(p-fssapmnt)[19189]: INFO: Running stop for /dev/mapper/sapmnt 
on /sapmnt
pipci001 lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:stop 
call_id:72 pid:19189 exit-code:0 exec-time:83ms queue-time:0ms
pipci001 crmd[17733]:   notice: Result of stop operation for p-fssapmnt 
on pipci001: 

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-09 Thread Jan Friesse

Thomas,


Hi,

On 3/7/18 1:41 PM, Jan Friesse wrote:

Thomas,


First thanks for your answer!

On 3/7/18 11:16 AM, Jan Friesse wrote:


...


TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)


Hope I did both test right, but as it reproduces multiple times
with testcpg, our cpg usage in our filesystem, this seems like
valid tested, not just an single occurrence.


I've tested it too and yes, you are 100% right. Bug is there and it's 
pretty easy to reproduce when node with lowest nodeid is paused. It's 
slightly harder when node with higher nodeid is paused.


Most of the clusters are using power fencing, so they simply never sees 
this problem. That may be also the reason why it wasn't reported long 
time ago (this bug exists virtually at least since OpenAIS Whitetank). 
So really nice work with finding this bug.


What I'm not entirely sure is what may be best way to solve this 
problem. What I'm sure is, that it's going to be "fun" :(


Lets start with very high level of possible solutions:
- "Ignore the problem". CPG behaves more or less correctly. "Current" 
membership really didn't changed so it doesn't make too much sense to 
inform about change. It's possible to use cpg_totem_confchg_fn_t to find 
out when ringid changes. I'm adding this solution just for completeness, 
because I don't prefer it at all.

- cpg_confchg_fn_t adds all left and back joined into left/join list
- cpg will sends extra cpg_confchg_fn_t call about left and joined 
nodes. I would prefer this solution simply because it makes cpg behavior 
equal in all situations.


Which of the options you would prefer? Same question also for @Ken (-> 
what would you prefer for PCMK) and @Chrissie.


Regards,
  Honza




cheers,
Thomas




Now it's really cpg application problem to synchronize its data. Many 
applications (usually FS) are using quorum together with fencing to find out, 
which cluster partition is quorate and clean inquorate one.

Hopefully my explanation help you and feel free to ask more questions!



They help, but I'm still a bit unsure about why the CB could not happen here,
may need to dive a bit deeper into corosync :)


Regards,
   Honza



help would be appreciated, much thanks!

cheers,
Thomas

[1]: 
https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
[2]: 
https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096














___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Cluster-devel] DLM connection channel switch take too long time (> 5mins)

2018-03-09 Thread Digimer
On 2018-03-09 01:32 AM, Gang He wrote:
> Hello Digimer,
> 
> 
> 

>> On 2018-03-08 12:10 PM, David Teigland wrote:
 I use active rrp_mode in corosync.conf and reboot the cluster to let the 
>> configuration effective.
 But, the about 5 mins hang in new_lockspace() function is still here.
>>>
>>> The last time I tested connection failures with sctp was several years
>>> ago, but I recall seeing similar problems.  I had hoped that some of the
>>> sctp changes might have helped, but perhaps they didn't.
>>> Dave
>>
>> To add to this; We found serious issues with DLM over sctp/rrp. Our
>> solution was to remove RRP and reply on active/passive (mode=1) bonding.
>> I do not believe you can make anything using DLM reliable on RRP in
>> either active or passive mode.
> Do you have the detailed steps to describe this workaround? 
> My means is, how to remove RRP? and reply on active/passive (mode=1) bonding?
> From the code, we have to use sctp protocol in DLM on a two-rings cluster.
> 
> Thanks
> Gang

I'm using RHEL 6, so for me, disabling rrp was simply removing the rrp
attribute and the  child elements. As for bonding, here's how I
did it;

https://www.alteeve.com/w/AN!Cluster_Tutorial_2#Configuring_our_Bridge.2C_Bonds_and_Interfaces

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 答复: 答复: 答复: How to configure to make each slave resource has one VIP

2018-03-09 Thread Jehan-Guillaume de Rorthais
On Fri, 9 Mar 2018 00:54:00 +
范国腾  wrote:

> Thanks Rorthais, Got it. The following command could make sure that it move
> to the master if there is no standby alive:
> 
> pcs constraint colocation add pgsql-ip-stby1 with slave pgsql-ha 100
> pcs constraint colocation add pgsql-ip-stby1 with pgsql-ha 50

Exact.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] copy file

2018-03-09 Thread Mevo Govo
Hi,
Thank for advices, I'm thinking an optimal config for us. While the DB is
working, it would do native DB replication. But oracle needs synchronized
controlfiles when it starts normally. I can save the file before overwrite
it. Currently I mean this (c1, c2, c3, c4, c5, c6 are control files):

c1: on node A, local file system
c2: on node A, on DRBD device1
c3: on node A, on DRBD device2 (FRA)
c4: on node B, on DRBD device2 (FRA)
c5: on node B, on DRBD device1
c6: on node B, local file system

c2+c3 is a "standard" oracle config. c2 is replicated into FRA (fast
recovery area of oracle). c1 (and c6) is just if all data in DRBD is lost.
c1, c2, c3, c4, c5 (but not c6) are in sync while DB runs on node A.
(c1,c2,c3: native DB replication, c2-c5, c3-c4 DRBD replication, protocol C)
When I switch from node A to node B, c6 is out of sync (older version). I
can (and I will) save it before overwrite by c5. But if c5 is corrupt,
manual repair is needed, and there are other replications to repair it (c4,
c3, c2, c1)
If c1 and c6 would be the same file on an nfs filesystem, there would be a
replication outside of DRBD without this "copy sync" problem. But in this
case, the fail of only one component (the nfs) would cause unavailable the
oracle DB on both node. (oracle DB stops if either of controlfile is lost
or corrupted. No automatic reapair happens)
As I think, the above consideration is similar to 3 node.
If we trust DRBD, no c1 and c6 would be needed, but we are new users of
DRBD.
Thanks: lados.





2018-03-08 20:12 GMT+01:00 Ken Gaillot :

> On Thu, 2018-03-08 at 18:49 +0100, Mevo Govo wrote:
> > Hi,
> > thanks for advice and your intrest.
> > We would use oracle database over DRBD. Datafiles (and control and
> > redo files) will be on DRBD. FRA also (on an other DRBD device). But
> > we are new in DRBD, and DRBD is also a component what can fails. We
> > plan a scenario to recover the database without DRBD (without data
> > loss, if possible). We would use nfs or local filesystem for this. If
> > we use local FS, the control file is out of sync on the B node when
> > switch over (from A to B). We would copy controlfile (and redo files)
> > from DRBD to the local FS. After this, oracle can start, and it keeps
> > the controlfiles syncronized. If other backup related files (archlog,
> > backup) are also available on the local FS of either node, we can
> > recover the DB without DRBD (without data loss)
> > (I know it is a worst case scenario, because if DRBD fails, the FS on
> > it should be available at least on one node)
> > Thanks: lados.
>
> Why not use native database replication instead of copying files?
>
> Any method getting files from a DRBD cluster to a non-DRBD node will
> have some inherent problems: it would have to be periodic, losing some
> data since the last run; it would still fail if some DRBD issue
> corrupted the on-disk data, because you would be copying the corrupted
> data; and databases generally have in-memory state information that
> makes files copied from a live server insufficient for data integrity.
>
> Native replication would avoid all that.
>
> > 2018-03-07 10:20 GMT+01:00 Klaus Wenninger :
> > > On 03/07/2018 10:03 AM, Mevo Govo wrote:
> > > > Thanks for advices, I will try!
> > > > lados.
> > > >
> > > > 2018-03-05 23:29 GMT+01:00 Ken Gaillot  > > > >:
> > > >
> > > > On Mon, 2018-03-05 at 15:09 +0100, Mevo Govo wrote:
> > > > > Hi,
> > > > > I am new in pacemaker. I think, I should use DRBD instead
> > > of copy
> > > > > file. But in this case, I would copy a file from a DRBD to
> > > an
> > > > > external device. Is there a builtin way to copy a file
> > > before a
> > > > > resource is started (and after the DRBD is promoted)? For
> > > example a
> > > > > "copy" resource? I did not find it.
> > > > > Thanks: lados.
> > > > >
> > > >
> > > > There's no stock way of doing that, but you could easily
> > > write an
> > > > agent
> > > > that simply copies a file. You could use ocf:pacemaker:Dummy
> > > as a
> > > > template, and add the copy to the start action. You can use
> > > standard
> > > > ordering and colocation constraints to make sure everything
> > > happens in
> > > > the right sequence.
> > > >
> > > > I don't know what capabilities your external device has, but
> > > another
> > > > approach would be to an NFS server to share the DRBD file
> > > system, and
> > > > mount it from the device, if you want direct access to the
> > > original
> > > > file rather than a copy.
> > > >
> > >
> > > csync2 & rsync might be considered as well although not knowing
> > > your scenario in detail it is hard to tell if it would be overkill.
> > >
> > > Regards,
> > > Klaus
> > >
> > > > --
> > > > Ken Gaillot 
> > > >
> > > >