Re: [ClusterLabs] (no subject)

2017-05-24 Thread Digimer
On 24/05/17 04:36 PM, Christopher Pax wrote:
> TO ADMIN: I am going to resubmit this question. please delete this thread.
> 
> thanks

We don't delete messages (and couldn't really if we wanted to anyway,
given it is email based). I am sure responders will reply to the new thread.

cheers,

madi

> On Wed, May 24, 2017 at 4:10 PM, Christopher Pax  > wrote:
> 
> In reference to previous email (which I accidentally sent without
> details)
> 
> I am running postgresql as a resource in corosync. and there is a
> monitor process that kicks off every few seconds to see if
> postgresql is alive (it runs a select now). My immediate conserning
> is that it is generating alot of logs in auth.log, and I am
> wondering of this is normal behavior? Is there a way to silence this?
> 
> also, here is the primitive snipit:
> primitive res_pgsql_2 pgsql \
> params pgdata="/mnt/drbd/postgres"
> config="/mnt/drbd/postgres/postgresql.conf" start_opt="-d 2"
> pglibs="/usr/lib/postgresql/9.5/lib"
> logfile="/var/log/postgresql/data.log" \
> operations $id=res_pgsql_1-operations \
> op start interval=0 timeout=60 \
> op stop interval=0 timeout=60 \
> op monitor interval=3 timeout=60 start-delay=0
> 
> 
> 
> 
> On Wed, May 24, 2017 at 4:06 PM, Christopher Pax  > wrote:
> 
> ##
> ## /var/log/auth.log
> ##
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session):
> session opened for user postgres by (uid=0)
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session):
> session closed for user postgres
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session):
> session opened for user postgres by (uid=0)
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session):
> session closed for user postgres
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session):
> session opened for user postgres by (uid=0)
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session):
> session closed for user postgres
> 
> ##
> ## /var/log/postgresql/data.log
> ##
> DEBUG:  forked new backend, pid=27900 socket=11
> LOG:  connection received: host=[local]
> LOG:  connection authorized: user=postgres database=template1
> LOG:  statement: select now();
> LOG:  disconnection: session time: 0:00:00.003 user=postgres
> database=template1 host=[local]
> DEBUG:  server process (PID 27900) exited with exit code 0
> DEBUG:  forked new backend, pid=28030 socket=11
> LOG:  connection received: host=[local]
> LOG:  connection authorized: user=postgres database=template1
> LOG:  statement: select now();
> LOG:  disconnection: session time: 0:00:00.002 user=postgres
> database=template1 host=[local]
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> 
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] resource monitor logging

2017-05-24 Thread Christopher Pax
I am running postgresql as a resource in corosync. and there is a monitor
process that kicks off every few seconds to see if postgresql is alive (it
runs a select now()). My immediate concern is that it is generating alot of
logs in auth.log, and I am wondering of this is normal behavior? Is there a
way to silence this?

##
## /var/log/auth.log
##
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
opened for user postgres by (uid=0)
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
closed for user postgres
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
opened for user postgres by (uid=0)
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
closed for user postgres
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
opened for user postgres by (uid=0)
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
closed for user postgres

##
## /var/log/postgresql/data.log
##
DEBUG:  forked new backend, pid=27900 socket=11
LOG:  connection received: host=[local]
LOG:  connection authorized: user=postgres database=template1
LOG:  statement: select now();
LOG:  disconnection: session time: 0:00:00.003 user=postgres
database=template1 host=[local]
DEBUG:  server process (PID 27900) exited with exit code 0
DEBUG:  forked new backend, pid=28030 socket=11
LOG:  connection received: host=[local]
LOG:  connection authorized: user=postgres database=template1
LOG:  statement: select now();
LOG:  disconnection: session time: 0:00:00.002 user=postgres
database=template1 host=[local]


## snippit of pgsql corosync primitive
primitive res_pgsql_2 pgsql \
params pgdata="/mnt/drbd/postgres"
config="/mnt/drbd/postgres/postgresql.conf"
start_opt="-d 2" pglibs="/usr/lib/postgresql/9.5/lib"
logfile="/var/log/postgresql/data.log" \
operations $id=res_pgsql_1-operations \
op start interval=0 timeout=60 \
op stop interval=0 timeout=60 \
op monitor interval=3 timeout=60 start-delay=0
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] (no subject)

2017-05-24 Thread Christopher Pax
TO ADMIN: I am going to resubmit this question. please delete this thread.

thanks

On Wed, May 24, 2017 at 4:10 PM, Christopher Pax  wrote:

> In reference to previous email (which I accidentally sent without details)
>
> I am running postgresql as a resource in corosync. and there is a monitor
> process that kicks off every few seconds to see if postgresql is alive (it
> runs a select now). My immediate conserning is that it is generating alot
> of logs in auth.log, and I am wondering of this is normal behavior? Is
> there a way to silence this?
>
> also, here is the primitive snipit:
> primitive res_pgsql_2 pgsql \
> params pgdata="/mnt/drbd/postgres" 
> config="/mnt/drbd/postgres/postgresql.conf"
> start_opt="-d 2" pglibs="/usr/lib/postgresql/9.5/lib"
> logfile="/var/log/postgresql/data.log" \
> operations $id=res_pgsql_1-operations \
> op start interval=0 timeout=60 \
> op stop interval=0 timeout=60 \
> op monitor interval=3 timeout=60 start-delay=0
>
>
>
>
> On Wed, May 24, 2017 at 4:06 PM, Christopher Pax  wrote:
>
>> ##
>> ## /var/log/auth.log
>> ##
>> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
>> opened for user postgres by (uid=0)
>> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
>> closed for user postgres
>> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
>> opened for user postgres by (uid=0)
>> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
>> closed for user postgres
>> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
>> opened for user postgres by (uid=0)
>> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
>> closed for user postgres
>>
>> ##
>> ## /var/log/postgresql/data.log
>> ##
>> DEBUG:  forked new backend, pid=27900 socket=11
>> LOG:  connection received: host=[local]
>> LOG:  connection authorized: user=postgres database=template1
>> LOG:  statement: select now();
>> LOG:  disconnection: session time: 0:00:00.003 user=postgres
>> database=template1 host=[local]
>> DEBUG:  server process (PID 27900) exited with exit code 0
>> DEBUG:  forked new backend, pid=28030 socket=11
>> LOG:  connection received: host=[local]
>> LOG:  connection authorized: user=postgres database=template1
>> LOG:  statement: select now();
>> LOG:  disconnection: session time: 0:00:00.002 user=postgres
>> database=template1 host=[local]
>>
>>
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] (no subject)

2017-05-24 Thread Christopher Pax
In reference to previous email (which I accidentally sent without details)

I am running postgresql as a resource in corosync. and there is a monitor
process that kicks off every few seconds to see if postgresql is alive (it
runs a select now). My immediate conserning is that it is generating alot
of logs in auth.log, and I am wondering of this is normal behavior? Is
there a way to silence this?

also, here is the primitive snipit:
primitive res_pgsql_2 pgsql \
params pgdata="/mnt/drbd/postgres"
config="/mnt/drbd/postgres/postgresql.conf" start_opt="-d 2"
pglibs="/usr/lib/postgresql/9.5/lib" logfile="/var/log/postgresql/data.log"
\
operations $id=res_pgsql_1-operations \
op start interval=0 timeout=60 \
op stop interval=0 timeout=60 \
op monitor interval=3 timeout=60 start-delay=0




On Wed, May 24, 2017 at 4:06 PM, Christopher Pax  wrote:

> ##
> ## /var/log/auth.log
> ##
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
> opened for user postgres by (uid=0)
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
> closed for user postgres
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
> opened for user postgres by (uid=0)
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
> closed for user postgres
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
> opened for user postgres by (uid=0)
> May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
> closed for user postgres
>
> ##
> ## /var/log/postgresql/data.log
> ##
> DEBUG:  forked new backend, pid=27900 socket=11
> LOG:  connection received: host=[local]
> LOG:  connection authorized: user=postgres database=template1
> LOG:  statement: select now();
> LOG:  disconnection: session time: 0:00:00.003 user=postgres
> database=template1 host=[local]
> DEBUG:  server process (PID 27900) exited with exit code 0
> DEBUG:  forked new backend, pid=28030 socket=11
> LOG:  connection received: host=[local]
> LOG:  connection authorized: user=postgres database=template1
> LOG:  statement: select now();
> LOG:  disconnection: session time: 0:00:00.002 user=postgres
> database=template1 host=[local]
>
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] (no subject)

2017-05-24 Thread Christopher Pax
##
## /var/log/auth.log
##
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
opened for user postgres by (uid=0)
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
closed for user postgres
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
opened for user postgres by (uid=0)
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
closed for user postgres
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
opened for user postgres by (uid=0)
May 24 15:23:19 ssinode02-g2 runuser: pam_unix(runuser:session): session
closed for user postgres

##
## /var/log/postgresql/data.log
##
DEBUG:  forked new backend, pid=27900 socket=11
LOG:  connection received: host=[local]
LOG:  connection authorized: user=postgres database=template1
LOG:  statement: select now();
LOG:  disconnection: session time: 0:00:00.003 user=postgres
database=template1 host=[local]
DEBUG:  server process (PID 27900) exited with exit code 0
DEBUG:  forked new backend, pid=28030 socket=11
LOG:  connection received: host=[local]
LOG:  connection authorized: user=postgres database=template1
LOG:  statement: select now();
LOG:  disconnection: session time: 0:00:00.002 user=postgres
database=template1 host=[local]
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] ClusterIP won't return to recovered node

2017-05-24 Thread Dan Ragle
I suspect this has been asked before and apologize if so, a google 
search didn't seem to find anything that was helpful to me ...


I'm setting up an active/active two-node cluster and am having an issue 
where one of my two defined clusterIPs will not return to the other node 
after it (the other node) has been recovered.


I'm running on CentOS 7.3. My resource setups look like this:

# cibadmin -Q|grep dc-version
value="1.1.15-11.el7_3.4-e174ec8"/>


# pcs resource show PublicIP-clone
 Clone: PublicIP-clone
  Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true 
interleave=true

  Resource: PublicIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=75.144.71.38 cidr_netmask=24 nic=bond0
   Meta Attrs: resource-stickiness=0
   Operations: start interval=0s timeout=20s (PublicIP-start-interval-0s)
   stop interval=0s timeout=20s (PublicIP-stop-interval-0s)
   monitor interval=30s (PublicIP-monitor-interval-30s)

# pcs resource show PrivateIP-clone
 Clone: PrivateIP-clone
  Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true 
interleave=true

  Resource: PrivateIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=192.168.1.3 nic=bond1 cidr_netmask=24
   Meta Attrs: resource-stickiness=0
   Operations: start interval=0s timeout=20s (PrivateIP-start-interval-0s)
   stop interval=0s timeout=20s (PrivateIP-stop-interval-0s)
   monitor interval=10s timeout=20s 
(PrivateIP-monitor-interval-10s)


# pcs constraint --full | grep -i publicip
  start WEB-clone then start PublicIP-clone (kind:Mandatory) 
(id:order-WEB-clone-PublicIP-clone-mandatory)

# pcs constraint --full | grep -i privateip
  start WEB-clone then start PrivateIP-clone (kind:Mandatory) 
(id:order-WEB-clone-PrivateIP-clone-mandatory)


When I first create the resources, they split across the two nodes as 
expected/desired:


 Clone Set: PublicIP-clone [PublicIP] (unique)
 PublicIP:0(ocf::heartbeat:IPaddr2):   Started node1-pcs
 PublicIP:1(ocf::heartbeat:IPaddr2):   Started node2-pcs
 Clone Set: PrivateIP-clone [PrivateIP] (unique)
 PrivateIP:0(ocf::heartbeat:IPaddr2):   Started node1-pcs
 PrivateIP:1(ocf::heartbeat:IPaddr2):   Started node2-pcs
 Clone Set: WEB-clone [WEB]
 Started: [ node1-pcs node2-pcs ]

I then put the second node in standby:

# pcs node standby node2-pcs

And the IPs both jump to node1 as expected:

 Clone Set: PublicIP-clone [PublicIP] (unique)
 PublicIP:0(ocf::heartbeat:IPaddr2):   Started node1-pcs
 PublicIP:1(ocf::heartbeat:IPaddr2):   Started node1-pcs
 Clone Set: WEB-clone [WEB]
 Started: [ node1-pcs ]
 Stopped: [ node2-pcs ]
 Clone Set: PrivateIP-clone [PrivateIP] (unique)
 PrivateIP:0(ocf::heartbeat:IPaddr2):   Started node1-pcs
 PrivateIP:1(ocf::heartbeat:IPaddr2):   Started node1-pcs

Then unstandby the second node:

# pcs node unstandby node2-pcs

The publicIP goes back, but the private does not:

 Clone Set: PublicIP-clone [PublicIP] (unique)
 PublicIP:0(ocf::heartbeat:IPaddr2):   Started node1-pcs
 PublicIP:1(ocf::heartbeat:IPaddr2):   Started node2-pcs
 Clone Set: WEB-clone [WEB]
 Started: [ node1-pcs node2-pcs ]
 Clone Set: PrivateIP-clone [PrivateIP] (unique)
 PrivateIP:0(ocf::heartbeat:IPaddr2):   Started node1-pcs
 PrivateIP:1(ocf::heartbeat:IPaddr2):   Started node1-pcs

Anybody see what I'm doing wrong? I'm not seeing anything in the logs to 
indicate that it tries node2 and then fails; but I'm fairly new to the 
software so it's possible I'm not looking in the right place.


Also, I noticed when putting a node in standby the main NIC appears to 
be interrupted momentarily (long enough for my SSH session, which is 
connected via the permanent IP on the NIC and not the clusterIP, to be 
dropped). Is there any way to avoid this? I was thinking that the 
cluster operations would only affect the ClusteIP and not the other IPs 
being served on that NIC.


Thanks!

Dan


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-24 Thread Attila Megyeri
Hi Klaus,

Thank you for your response.
I tried many things, but no luck.

We have many pacemaker clusters with 99% identical configurations, package 
versions, and only this one causes issues. (BTW we use unicast for corosync, 
but this is the same for our other clusters as well.)
I checked all connection settings between the nodes (to confirm there are no 
firewall issues), increased the number of cores on each node, but still - as 
long as a monitor operation is pending for a resource, no other operation is 
executed.

e.g. resource A is being monitored, and timeout is 90 seconds, until this check 
times out I cannot do a cleanup or start/stop on any other resource.

Two more interesting things: 
- cluster recheck is set to 2 minutes, and even though the resources are 
running properly, the fail counters are not reduced and crm_mon lists the 
resources in failed actions section. forever. Or until I manually do resource 
cleanup.
- If i execute a crm resource cleanup RES_name from another node, sometimes it 
simply does not clean up the failed state. If I execute this from the node 
where the resource IS actually runing, the resource is removed from the failed 
actions.


What do you recommend, how could I start troubleshooting these issues? As I 
said, this setup works fine in several other systems, but here I am 
really-realy stuck.


thanks!

Attila





> -Original Message-
> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
> Sent: Wednesday, May 10, 2017 2:04 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
> 
> On 05/09/2017 10:34 PM, Attila Megyeri wrote:
> >
> > Actually I found some more details:
> >
> >
> >
> > there are two resources: A and B
> >
> >
> >
> > resource B depends on resource A (when the RA monitors B, if will fail
> > if A is not running properly)
> >
> >
> >
> > If I stop resource A, the next monitor operation of "B" will fail.
> > Interestingly, this check happens immediately after A is stopped.
> >
> >
> >
> > B is configured to restart if monitor fails. Start timeout is rather
> > long, 180 seconds. So pacemaker tries to restart B, and waits.
> >
> >
> >
> > If I want to start "A", nothing happens until the start operation of
> > "B" fails - typically several minutes.
> >
> >
> >
> >
> >
> > Is this the right behavior?
> >
> > It appears that pacemaker is blocked until resource B is being
> > started, and I cannot really start its dependency...
> >
> > Shouldn't it be possible to start a resource while another resource is
> > also starting?
> >
> 
> As long as resources don't depend on each other parallel starting should
> work/happen.
> 
> The number of parallel actions executed is derived from the number of
> cores and
> when load is detected some kind of throttling kicks in (in fact reduction of
> the operations executed in parallel with the aim to reduce the load induced
> by pacemaker). When throttling kicks in you should get log messages (there
> is in fact a parallel discussion going on ...).
> No idea if throttling might be a reason here but maybe worth considering
> at least.
> 
> Another reason why certain things happen with quite some delay I've
> observed
> is that obviously some situations are just resolved when the
> cluster-recheck-interval
> triggers a pengine run in addition to those triggered by changes.
> You might easily verify this by changing the cluster-recheck-interval.
> 
> Regards,
> Klaus
> 
> >
> >
> >
> >
> > Thanks,
> >
> > Attila
> >
> >
> >
> >
> >
> > *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com]
> > *Sent:* Tuesday, May 9, 2017 9:53 PM
> > *To:* users@clusterlabs.org; kgail...@redhat.com
> > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond
> >
> >
> >
> > Hi Ken, all,
> >
> >
> >
> >
> >
> > We ran into an issue very similar to the one described in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug]
> > Pacemaker occasionally takes minutes to respond
> >
> >
> >
> > But  in our case we are not using fencing/stonith at all.
> >
> >
> >
> > Many times when I want to start/stop/cleanup a resource, it takes tens
> > of seconds (or even minutes) till the command gets executed. The logs
> > show nothing in that period, the redundant rings show no fault.
> >
> >
> >
> > Could this be the same issue?
> >
> >
> >
> > Any hints on how to troubleshoot this?
> >
> > It is  pacemaker 1.1.10, corosync 2.3.3
> >
> >
> >
> >
> >
> > Cheers,
> >
> > Attila
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> --
> Klaus Wenninger
> 
> Senior Software Engineer, EMEA ENG Openstack Infrastructure
> 
> Red Hat
> 
> 

[ClusterLabs] Node attribute disappears when pacemaker is started

2017-05-24 Thread 井上 和徳
Hi,

After loading the node attribute, when I start pacemaker of that node, the 
attribute disappears.

1. Start pacemaker on node1.
2. Load configure containing node attribute of node2.
   (I use multicast addresses in corosync, so did not set "nodelist {nodeid: }" 
in corosync.conf.)
3. Start pacemaker on node2, the node attribute that should have been load 
disappears.
   Is this specifications ?

1.
[root@rhel73-1 ~]# systemctl start corosync;systemctl start pacemaker
[root@rhel73-1 ~]# crm configure show
node 3232261507: rhel73-1
property cib-bootstrap-options: \
  have-watchdog=false \
  dc-version=1.1.17-0.1.rc2.el7-524251c \
  cluster-infrastructure=corosync

2.
[root@rhel73-1 ~]# cat rhel73-2.crm
node rhel73-2 \
  utilization capacity="2" \
  attributes attrname="attr2"

[root@rhel73-1 ~]# crm configure load update rhel73-2.crm
[root@rhel73-1 ~]# crm configure show
node 3232261507: rhel73-1
node rhel73-2 \
  utilization capacity=2 \
  attributes attrname=attr2
property cib-bootstrap-options: \
  have-watchdog=false \
  dc-version=1.1.17-0.1.rc2.el7-524251c \
  cluster-infrastructure=corosync

3.
[root@rhel73-1 ~]# ssh rhel73-2 'systemctl start corosync;systemctl start 
pacemaker'
[root@rhel73-1 ~]# crm configure show
node 3232261507: rhel73-1
node 3232261508: rhel73-2
property cib-bootstrap-options: \
  have-watchdog=false \
  dc-version=1.1.17-0.1.rc2.el7-524251c \
  cluster-infrastructure=corosync

Regards,
Kazunori INOUE

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

2017-05-24 Thread Nikhil Utane
Thanks Aswathi.

(My account had stopped working due to mail bounces, never seen that occur
on gmail accounts)

Ken,

Answers to your questions are below:

*1. Using force option*
A) During our testing we had observed that in some instances the resource
deletion would fail and that's why we added the force option. With the
force option we never saw the problem again.

*2. "Maybe in this particular instance, you actually did "crm_resource
-C"?"*
A) This step is done through code so there is no human involvement. We are
printing the full command and we always see resource name is included. So
this cannot happen.

*3.  $ crm_node -R 0005B94238BC --force*
A) Yes, we want to remove the node completely. We are not specifying the
node information in corosync.conf so there is nothing to be removed there.
I need to go back and check but I vaguely remember that because of some
issue we had switched from using "pcs cluster node remove" command to
crm_node -R command. Perhaps because it gave us the option to use force.

*4. "No STONITH and QUORUM"*
A) As I have mentioned earlier, split-brain doesn't pose a problem for us
since we have a second line of defense based on our architecture. Hence we
have made a conscious decision to disable it. The config IS for production.

BTW, we also issue a "pcs resource disable" command before doing a "pcs
resource delete". Not sure if that makes any difference.

We will play around with those 4-5 commands that we execute to see whether
the resource restart happens as a reaction to any of those command.

-Thanks & Regards
Nikhil

On Wed, May 24, 2017 at 11:28 AM, Anu Pillai 
wrote:

> blank response for thread to appear in mailbox..pls ignore
>
> On Tue, May 23, 2017 at 4:21 AM, Ken Gaillot  wrote:
>
>> On 05/16/2017 04:34 AM, Anu Pillai wrote:
>> > Hi,
>> >
>> > Please find attached debug logs for the stated problem as well as
>> > crm_mon command outputs.
>> > In this case we are trying to remove/delete res3 and system/node
>> > (0005B94238BC) from the cluster.
>> >
>> > *_Test reproduction steps_*
>> >
>> > Current Configuration of the cluster:
>> >  0005B9423910  - res2
>> >  0005B9427C5A - res1
>> >  0005B94238BC - res3
>> >
>> > *crm_mon output:*
>> >
>> > Defaulting to one-shot mode
>> > You need to have curses available at compile time to enable console mode
>> > Stack: corosync
>> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with
>> quorum
>> > Last updated: Tue May 16 12:21:23 2017  Last change: Tue May 16
>> > 12:13:40 2017 by root via crm_attribute on 0005B9423910
>> >
>> > 3 nodes and 3 resources configured
>> >
>> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
>> >
>> >  res2   (ocf::redundancy:RedundancyRA): Started 0005B9423910
>> >  res1   (ocf::redundancy:RedundancyRA): Started 0005B9427C5A
>> >  res3   (ocf::redundancy:RedundancyRA): Started 0005B94238BC
>> >
>> >
>> > Trigger the delete operation for res3 and node 0005B94238BC.
>> >
>> > Following commands applied from node 0005B94238BC
>> > $ pcs resource delete res3 --force
>> > $ crm_resource -C res3
>> > $ pcs cluster stop --force
>>
>> I don't think "pcs resource delete" or "pcs cluster stop" does anything
>> with the --force option. In any case, --force shouldn't be needed here.
>>
>> The crm_mon output you see is actually not what it appears. It starts
>> with:
>>
>> May 16 12:21:27 [4661] 0005B9423910   crmd:   notice: do_lrm_invoke:
>>Forcing the status of all resources to be redetected
>>
>> This is usually the result of a "cleanup all" command. It works by
>> erasing the resource history, causing pacemaker to re-probe all nodes to
>> get the current state. The history erasure makes it appear to crm_mon
>> that the resources are stopped, but they actually are not.
>>
>> In this case, I'm not sure why it's doing a "cleanup all", since you
>> only asked it to cleanup res3. Maybe in this particular instance, you
>> actually did "crm_resource -C"?
>>
>> > Following command applied from DC(0005B9423910)
>> > $ crm_node -R 0005B94238BC --force
>>
>> This can cause problems. This command shouldn't be run unless the node
>> is removed from both pacemaker's and corosync's configuration. If you
>> actually are trying to remove the node completely, a better alternative
>> would be "pcs cluster node remove 0005B94238BC", which will handle all
>> of that for you. If you're not trying to remove the node completely,
>> then you shouldn't need this command at all.
>>
>> >
>> >
>> > *crm_mon output:*
>> > *
>> > *
>> > Defaulting to one-shot mode
>> > You need to have curses available at compile time to enable console mode
>> > Stack: corosync
>> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with
>> quorum
>> > Last updated: Tue May 16 12:21:27 2017  Last change: Tue May 16
>> > 12:21:26 2017 by root via cibadmin on 0005B94238BC
>> >
>> > 3 nodes and 2 resources configured
>> >
>> > Online: [ 

Re: [ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

2017-05-24 Thread Anu Pillai
blank response for thread to appear in mailbox..pls ignore

On Tue, May 23, 2017 at 4:21 AM, Ken Gaillot  wrote:

> On 05/16/2017 04:34 AM, Anu Pillai wrote:
> > Hi,
> >
> > Please find attached debug logs for the stated problem as well as
> > crm_mon command outputs.
> > In this case we are trying to remove/delete res3 and system/node
> > (0005B94238BC) from the cluster.
> >
> > *_Test reproduction steps_*
> >
> > Current Configuration of the cluster:
> >  0005B9423910  - res2
> >  0005B9427C5A - res1
> >  0005B94238BC - res3
> >
> > *crm_mon output:*
> >
> > Defaulting to one-shot mode
> > You need to have curses available at compile time to enable console mode
> > Stack: corosync
> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with quorum
> > Last updated: Tue May 16 12:21:23 2017  Last change: Tue May 16
> > 12:13:40 2017 by root via crm_attribute on 0005B9423910
> >
> > 3 nodes and 3 resources configured
> >
> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
> >
> >  res2   (ocf::redundancy:RedundancyRA): Started 0005B9423910
> >  res1   (ocf::redundancy:RedundancyRA): Started 0005B9427C5A
> >  res3   (ocf::redundancy:RedundancyRA): Started 0005B94238BC
> >
> >
> > Trigger the delete operation for res3 and node 0005B94238BC.
> >
> > Following commands applied from node 0005B94238BC
> > $ pcs resource delete res3 --force
> > $ crm_resource -C res3
> > $ pcs cluster stop --force
>
> I don't think "pcs resource delete" or "pcs cluster stop" does anything
> with the --force option. In any case, --force shouldn't be needed here.
>
> The crm_mon output you see is actually not what it appears. It starts with:
>
> May 16 12:21:27 [4661] 0005B9423910   crmd:   notice: do_lrm_invoke:
>Forcing the status of all resources to be redetected
>
> This is usually the result of a "cleanup all" command. It works by
> erasing the resource history, causing pacemaker to re-probe all nodes to
> get the current state. The history erasure makes it appear to crm_mon
> that the resources are stopped, but they actually are not.
>
> In this case, I'm not sure why it's doing a "cleanup all", since you
> only asked it to cleanup res3. Maybe in this particular instance, you
> actually did "crm_resource -C"?
>
> > Following command applied from DC(0005B9423910)
> > $ crm_node -R 0005B94238BC --force
>
> This can cause problems. This command shouldn't be run unless the node
> is removed from both pacemaker's and corosync's configuration. If you
> actually are trying to remove the node completely, a better alternative
> would be "pcs cluster node remove 0005B94238BC", which will handle all
> of that for you. If you're not trying to remove the node completely,
> then you shouldn't need this command at all.
>
> >
> >
> > *crm_mon output:*
> > *
> > *
> > Defaulting to one-shot mode
> > You need to have curses available at compile time to enable console mode
> > Stack: corosync
> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with quorum
> > Last updated: Tue May 16 12:21:27 2017  Last change: Tue May 16
> > 12:21:26 2017 by root via cibadmin on 0005B94238BC
> >
> > 3 nodes and 2 resources configured
> >
> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
> >
> >
> > Observation is remaining two resources res2 and res1 were stopped and
> > started.
> >
> >
> > Regards,
> > Aswathi
> >
> > On Mon, May 15, 2017 at 8:11 PM, Ken Gaillot  > > wrote:
> >
> > On 05/15/2017 06:59 AM, Klaus Wenninger wrote:
> > > On 05/15/2017 12:25 PM, Anu Pillai wrote:
> > >> Hi Klaus,
> > >>
> > >> Please find attached cib.xml as well as corosync.conf.
> >
> > Maybe you're only setting this while testing, but having
> > stonith-enabled=false and no-quorum-policy=ignore is highly
> dangerous in
> > any kind of network split.
> >
> > FYI, default-action-timeout is deprecated in favor of setting a
> timeout
> > in op_defaults, but it doesn't hurt anything.
> >
> > > Why wouldn't you keep placement-strategy with default
> > > to keep things simple. You aren't using any load-balancing
> > > anyway as far as I understood it.
> >
> > It looks like the intent is to use placement-strategy to limit each
> node
> > to 1 resource. The configuration looks good for that.
> >
> > > Haven't used resource-stickiness=INF. No idea which strange
> > > behavior that triggers. Try to have it just higher than what
> > > the other scores might some up to.
> >
> > Either way would be fine. Using INFINITY ensures that no other
> > combination of scores will override it.
> >
> > > I might have overseen something in your scores but otherwise
> > > there is nothing obvious to me.
> > >
> > > Regards,
> > > Klaus
> >
> > I don't see anything obvious either. If you have logs around the
> time of
> > the incident, that might help.
> >
> > >>