Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-12 Thread Jonathan Davies



On 12/10/17 07:48, Jan Friesse wrote:

Jonathan,
I believe main "problem" is votequorum ability to work during sync phase 
(votequorum is only one service with this ability, see 
votequorum_overview.8 section VIRTUAL SYNCHRONY)...



Hi ClusterLabs,

I'm seeing a race condition in corosync where votequorum can have
incorrect membership info when a node joins the cluster then leaves very
soon after.

I'm on corosync-2.3.4 plus my patch
https://github.com/corosync/corosync/pull/248. That patch makes the
problem readily reproducible but the bug was already present.

Here's the scenario. I have two hosts, cluster1 and cluster2. The
corosync.conf on cluster2 is:

 totem {
   version: 2
   cluster_name: test
   config_version: 2
   transport: udpu
 }
 nodelist {
   node {
 nodeid: 1
 ring0_addr: cluster1
   }
   node {
 nodeid: 2
 ring0_addr: cluster2
   }
 }
 quorum {
   provider: corosync_votequorum
   auto_tie_breaker: 1
 }
 logging {
   to_syslog: yes
 }

The corosync.conf on cluster1 is the same except with "config_version: 
1".


I start corosync on cluster2. When I start corosync on cluster1, it
joins and then immediately leaves due to the lower config_version.
(Previously corosync on cluster2 would also exit but with
https://github.com/corosync/corosync/pull/248 it remains alive.)

But often at this point, cluster1's disappearance is not reflected in
the votequorum info on cluster2:


... Is this permanent (= until new node join/leave it , or it will fix 
itself over (short) time? If this is permanent, it's a bug. If it fixes 
itself it's result of votequorum not being virtual synchronous.


Yes, it's permanent. After several minutes of waiting, votequorum still 
reports "total votes: 2" even though there's only one member.


Thanks,
Jonathan



 Quorum information
 --
 Date: Tue Oct 10 16:43:50 2017
 Quorum provider:  corosync_votequorum
 Nodes:    1
 Node ID:  2
 Ring ID:  700
 Quorate:  Yes

 Votequorum information
 --
 Expected votes:   2
 Highest expected: 2
 Total votes:  2
 Quorum:   2
 Flags:    Quorate AutoTieBreaker

 Membership information
 --
 Nodeid  Votes Name
  2  1 cluster2 (local)

The logs on cluster1 show:

 Oct 10 16:43:37 cluster1 corosync[15750]:  [CMAP  ] Received config
version (2) is different than my config version (1)! Exiting

The logs on cluster2 show:

 Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
(10.71.218.17:588) was formed. Members joined: 1
 Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] This node is
within the primary component and will provide service.
 Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
 Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
(10.71.218.18:592) was formed. Members left: 1
 Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
 Oct 10 16:43:37 cluster2 corosync[5102]:  [MAIN  ] Completed
service synchronization, ready to provide service.

It looks like QUORUM has seen cluster1's arrival but not its departure!

When it works as expected, the state is left consistent:

 Quorum information
 --
 Date: Tue Oct 10 16:58:14 2017
 Quorum provider:  corosync_votequorum
 Nodes:    1
 Node ID:  2
 Ring ID:  604
 Quorate:  No

 Votequorum information
 --
 Expected votes:   2
 Highest expected: 2
 Total votes:  1
 Quorum:   2 Activity blocked
 Flags:    AutoTieBreaker

 Membership information
 --
 Nodeid  Votes Name
  2  1 cluster2 (local)

Logs on cluster1:

 Oct 10 16:58:01 cluster1 corosync[16430]:  [CMAP  ] Received config
version (2) is different than my config version (1)! Exiting

Logs on cluster2 are either:

 Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
membership (10.71.218.17:600) was formed. Members joined: 1
 Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
within the primary component and will provide service.
 Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
 Oct 10 16:58:01 cluster2 corosync[17835]:  [CMAP  ] Highest config
version (2) and my config version (2)
 Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
membership (10.71.218.18:604) was formed. Members left: 1
 Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
within the non-primary component and will NOT provide any services.
 Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
 Oct 10 16:58:01 cluster2 corosync[17835]:  [MAIN

Re: [ClusterLabs] trouble with IPaddr2

2017-10-12 Thread Valentin Vidic
On Wed, Oct 11, 2017 at 02:36:24PM +0200, Valentin Vidic wrote:
> AFAICT, it found a better interface with that subnet and tried
> to use it instead of the one specified in the parameters :)
> 
> But maybe IPaddr2 should just skip interface auto-detection
> if an explicit interface was given in the parameters?

Oh it seems you specified nic only for the monitor operation so
it would fallback to auto-detect for start and stop actions:

primitive HA_IP-Serv1 IPaddr2 \
params ip=172.16.101.70 cidr_netmask=16 \
op monitor interval=20 timeout=30 on-fail=restart nic=bond0 \
meta target-role=Started

So you probably wanted this instead:

primitive HA_IP-Serv1 IPaddr2 \
params ip=172.16.101.70 cidr_netmask=16 nic=bond0 \
op monitor interval=20 timeout=30 on-fail=restart \
meta target-role=Started

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-12 Thread Jan Friesse

Jonathan,




On 12/10/17 07:48, Jan Friesse wrote:

Jonathan,
I believe main "problem" is votequorum ability to work during sync
phase (votequorum is only one service with this ability, see
votequorum_overview.8 section VIRTUAL SYNCHRONY)...


Hi ClusterLabs,

I'm seeing a race condition in corosync where votequorum can have
incorrect membership info when a node joins the cluster then leaves very
soon after.

I'm on corosync-2.3.4 plus my patch


Finally noticed ^^^ 2.3.4 is really old and as long as it is not some 
patched version, I wouldn't recommend to use it. Can you give a try to 
current needle?



https://github.com/corosync/corosync/pull/248. That patch makes the
problem readily reproducible but the bug was already present.

Here's the scenario. I have two hosts, cluster1 and cluster2. The
corosync.conf on cluster2 is:

 totem {
   version: 2
   cluster_name: test
   config_version: 2
   transport: udpu
 }
 nodelist {
   node {
 nodeid: 1
 ring0_addr: cluster1
   }
   node {
 nodeid: 2
 ring0_addr: cluster2
   }
 }
 quorum {
   provider: corosync_votequorum
   auto_tie_breaker: 1
 }
 logging {
   to_syslog: yes
 }

The corosync.conf on cluster1 is the same except with
"config_version: 1".

I start corosync on cluster2. When I start corosync on cluster1, it
joins and then immediately leaves due to the lower config_version.
(Previously corosync on cluster2 would also exit but with
https://github.com/corosync/corosync/pull/248 it remains alive.)

But often at this point, cluster1's disappearance is not reflected in
the votequorum info on cluster2:


... Is this permanent (= until new node join/leave it , or it will fix
itself over (short) time? If this is permanent, it's a bug. If it
fixes itself it's result of votequorum not being virtual synchronous.


Yes, it's permanent. After several minutes of waiting, votequorum still
reports "total votes: 2" even though there's only one member.



That's bad. I've tried following setup:

- Both nodes with current needle
- Your config
- Second node is just running corosync
- First node is running following command:
  while true;do corosync -f; ssh node2 'corosync-quorumtool | grep 
Total | grep 1' || exit 1;done


Running it for quite a while and I'm unable to reproduce the bug. Sadly 
I'm unable to reproduce the bug even with 2.3.4. Do you think that 
reproducer is correct?


Honza




Thanks,
Jonathan



 Quorum information
 --
 Date: Tue Oct 10 16:43:50 2017
 Quorum provider:  corosync_votequorum
 Nodes:1
 Node ID:  2
 Ring ID:  700
 Quorate:  Yes

 Votequorum information
 --
 Expected votes:   2
 Highest expected: 2
 Total votes:  2
 Quorum:   2
 Flags:Quorate AutoTieBreaker

 Membership information
 --
 Nodeid  Votes Name
  2  1 cluster2 (local)

The logs on cluster1 show:

 Oct 10 16:43:37 cluster1 corosync[15750]:  [CMAP  ] Received config
version (2) is different than my config version (1)! Exiting

The logs on cluster2 show:

 Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
(10.71.218.17:588) was formed. Members joined: 1
 Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] This node is
within the primary component and will provide service.
 Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
 Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
(10.71.218.18:592) was formed. Members left: 1
 Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
 Oct 10 16:43:37 cluster2 corosync[5102]:  [MAIN  ] Completed
service synchronization, ready to provide service.

It looks like QUORUM has seen cluster1's arrival but not its departure!

When it works as expected, the state is left consistent:

 Quorum information
 --
 Date: Tue Oct 10 16:58:14 2017
 Quorum provider:  corosync_votequorum
 Nodes:1
 Node ID:  2
 Ring ID:  604
 Quorate:  No

 Votequorum information
 --
 Expected votes:   2
 Highest expected: 2
 Total votes:  1
 Quorum:   2 Activity blocked
 Flags:AutoTieBreaker

 Membership information
 --
 Nodeid  Votes Name
  2  1 cluster2 (local)

Logs on cluster1:

 Oct 10 16:58:01 cluster1 corosync[16430]:  [CMAP  ] Received config
version (2) is different than my config version (1)! Exiting

Logs on cluster2 are either:

 Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
membership (10.71.218.17:600) was formed. Members joined: 1
 Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
wit

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-12 Thread Christine Caulfield
On 12/10/17 11:54, Jan Friesse wrote:
> Jonathan,
> 
>>
>>
>> On 12/10/17 07:48, Jan Friesse wrote:
>>> Jonathan,
>>> I believe main "problem" is votequorum ability to work during sync
>>> phase (votequorum is only one service with this ability, see
>>> votequorum_overview.8 section VIRTUAL SYNCHRONY)...
>>>
 Hi ClusterLabs,

 I'm seeing a race condition in corosync where votequorum can have
 incorrect membership info when a node joins the cluster then leaves
 very
 soon after.

 I'm on corosync-2.3.4 plus my patch
> 
> Finally noticed ^^^ 2.3.4 is really old and as long as it is not some
> patched version, I wouldn't recommend to use it. Can you give a try to
> current needle?
> 
 https://github.com/corosync/corosync/pull/248. That patch makes the
 problem readily reproducible but the bug was already present.

 Here's the scenario. I have two hosts, cluster1 and cluster2. The
 corosync.conf on cluster2 is:

  totem {
    version: 2
    cluster_name: test
    config_version: 2
    transport: udpu
  }
  nodelist {
    node {
  nodeid: 1
  ring0_addr: cluster1
    }
    node {
  nodeid: 2
  ring0_addr: cluster2
    }
  }
  quorum {
    provider: corosync_votequorum
    auto_tie_breaker: 1
  }
  logging {
    to_syslog: yes
  }

 The corosync.conf on cluster1 is the same except with
 "config_version: 1".

 I start corosync on cluster2. When I start corosync on cluster1, it
 joins and then immediately leaves due to the lower config_version.
 (Previously corosync on cluster2 would also exit but with
 https://github.com/corosync/corosync/pull/248 it remains alive.)

 But often at this point, cluster1's disappearance is not reflected in
 the votequorum info on cluster2:
>>>
>>> ... Is this permanent (= until new node join/leave it , or it will fix
>>> itself over (short) time? If this is permanent, it's a bug. If it
>>> fixes itself it's result of votequorum not being virtual synchronous.
>>
>> Yes, it's permanent. After several minutes of waiting, votequorum still
>> reports "total votes: 2" even though there's only one member.
> 
> 
> That's bad. I've tried following setup:
> 
> - Both nodes with current needle
> - Your config
> - Second node is just running corosync
> - First node is running following command:
>   while true;do corosync -f; ssh node2 'corosync-quorumtool | grep Total
> | grep 1' || exit 1;done
> 
> Running it for quite a while and I'm unable to reproduce the bug. Sadly
> I'm unable to reproduce the bug even with 2.3.4. Do you think that
> reproducer is correct?
> 

I can't reproduce it either.

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Debugging problems with resource timeout without any actions from cluster

2017-10-12 Thread Ken Gaillot
On Thu, 2017-10-12 at 17:13 +0600, Sergey Korobitsin wrote:
> Hello,
> I experience some strange problem on MySQL resource agent from
> Percona:
> sometimes monitor operation for it killed by lrmd due to timeout,
> like
> this:
> 
> Oct 12 12:26:46 sde1 lrmd[14812]:  warning: p_mysql_monitor_5000
> process (PID 28991) timed out
> Oct 12 12:27:15 sde1 lrmd[14812]:  warning:
> p_mysql_monitor_5000:28991 - timed out after 2ms
> Oct 12 12:27:15 sde1 crmd[14815]:error: Result of monitor
> operation for p_mysql on sde1: Timed Out
> 
> Now I investigate the problem, but trouble is that no extraordinary
> DB
> load or something else like that was detected. But, when those
> timeouts
> happen, Pacemaker tries to move MySQL (and all resources colocated
> with
> it) to other node (I have two-noded cluster). For some reasons I have
> other node in standby mode now, and Pacemaker move resources back,
> restarting them. All this moving/restarting leads our services to be
> unavailable for some time, and this is unwanted.
> 
> So, my purpose is to get cluster with MySQL and other colocated
> resources up, but only with resource monitoring, and without
> starting,
> stopping, promoting, demoting resources, etc.
> 
> I found several ways to achieve that:
> 
> 1. Put cluster in maintainance mode (as described here:
>    https://www.hastexo.com/resources/hints-and-kinks/maintenance-acti
> ve-pacemaker-clusters/)
> 
>    As far as I understand, services will be monitored, all logs
> written,
>    etc., but no action in case of failures will be taken. Is that
> right?

Actually, maintenance mode stops all monitors (except those with
role=Stopped, which ensure a service is not running).

> 
> 2. Put the particular resource to unmanaged mode, as described here:
>    http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemak
> er_Explained/#s-monitoring-unmanaged

Disabling starts and stops is the exact purpose of unmanaged, so this
is one way to get what you want. FYI you can also set this as a global
default for all resources by setting it in the resource defaults
section of the configuration.

> 3. Start all resources and remove start and stop operations from
> them.

:-O

> Which is the best way to achieve my purpose? I would like cluster to
> run
> as usual (and logging as usual or with trace on problematic
> resource),
> but no action in case of monitor failure should be taken.

That's actually a different goal, also easily accomplished, by setting
on-fail=ignore on the monitor operation. From the sound of it, this is
closer to what you want, since the cluster is still allowed to
start/stop resources when you standby a node, etc.

You could also delete the recurring monitor operation from the
configuration, and it wouldn't run at all. But keeping it and setting
on-fail=ignore lets you see failures in cluster status.

However, I'm not sure bypassing the monitor is the best solution to
this problem. If the problem is simply that your database monitor can
legitimately take longer than 20 seconds in normal operation, then
raise the timeout as needed.

> Here is the configuration of MySQL resource:
> 
> primitive p_mysql ocf:percona:mysql \
> params config="/etc/mysql/my.cnf"
> pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysqld.sock"
> replication_user=slave_user replication_passwd=password
> max_slave_lag=180 evict_outdated_slaves=false
> binary="/usr/sbin/mysqld" test_user=test test_passwd=test \
> op start interval=0 timeout=60s \
> op stop interval=0 timeout=60s \
> op monitor interval=5s role=Master OCF_CHECK_LEVEL=1 \
> op monitor interval=2s role=Slave OCF_CHECK_LEVEL=1
> 
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] can't move/migrate ressource

2017-10-12 Thread Ken Gaillot
On Wed, 2017-10-11 at 13:58 +0200, Stefan Krueger wrote:
> Hello,
> 
> when i try to migrate a ressource from one server to an other (for
> example for maintenance), it don't work.
> a single ressource works fine, after that I create a group with 2
> ressources and try to move that.
> 
> my config is:
> crm conf show
> node 739272007: zfs-serv1
> node 739272008: zfs-serv2
> primitive HA_IP-Serv1 IPaddr2 \
> params ip=172.16.101.70 cidr_netmask=16 \
> op monitor interval=20 timeout=30 on-fail=restart nic=bond0 \
> meta target-role=Started
> primitive HA_IP-Serv2 IPaddr2 \
> params ip=172.16.101.74 cidr_netmask=16 \
> op monitor interval=10s nic=bond0
> primitive nc_storage ZFS \
> params pool=nc_storage importargs="-d /dev/disk/by-
> partlabel/"
> group compl_zfs-serv1 nc_storage HA_IP-Serv1
> location cli-prefer-HA_IP-Serv1 compl_zfs-serv1 role=Started inf:
> zfs-serv1
> location cli-prefer-HA_IP-Serv2 HA_IP-Serv2 role=Started inf: zfs-
> serv2
> location cli-prefer-compl_zfs-serv1 compl_zfs-serv1 role=Started inf:
> zfs-serv2
> location cli-prefer-nc_storage compl_zfs-serv1 role=Started inf: zfs-
> serv1

Oddly, the above constraint applies to compl_zfs-serv1 even though its
name references nc_storage. So, you have two +INFINITY constraints for
compl_zfs-serv1, for both nodes. That's not a problem, but it's almost
certainly not what you want.

If you notice, the four location constraints above start with "cli-".
When you "move" a resource, the tools actually accomplish the move by
creating a new location constraint. If you just keep running "move",
you keep piling on constraints, which might conflict with each other.
It's best to clear the old constraints (the specific command varies by
the tool you're using) before doing a new move.

> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.16-94ff4df \
> cluster-infrastructure=corosync \
> cluster-name=debian \
> no-quorum-policy=ignore \
> default-resource-stickiness=100 \
> stonith-enabled=false \
> last-lrm-refresh=1507702403
> 
> 
> command:
> crm resource move compl_zfs-serv1 zfs-serv2
> 
> 
> pacemakerlog from zfs-serv2:
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_perform_op:  Diff: --- 0.106.0 2
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_perform_op:  Diff: +++ 0.107.0 cc224b15d0a796e040b026b7c2965770
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_perform_op:  --
> /cib/configuration/constraints/rsc_location[@id='cli-prefer-
> compl_zfs-serv1']
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_perform_op:  +  /cib:  @epoch=107
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_process_request: Completed cib_delete operation for section
> constraints: OK (rc=0, origin=zfs-serv1/crm_resource/3,
> version=0.107.0)
> Oct 11 13:55:58 [3561] zfs-serv2   crmd: info:
> abort_transition_graph:  Transition aborted by deletion of
> rsc_location[@id='cli-prefer-compl_zfs-serv1']: Configuration change
> | cib=0.107.0 source=te_update_diff:444
> path=/cib/configuration/constraints/rsc_location[@id='cli-prefer-
> compl_zfs-serv1'] complete=true
> Oct 11 13:55:58 [3561] zfs-serv2   crmd:   notice:
> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE |
> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_perform_op:  Diff: --- 0.107.0 2
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_perform_op:  Diff: +++ 0.108.0 (null)
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_perform_op:  +  /cib:  @epoch=108
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_perform_op:  ++ /cib/configuration/constraints:   id="cli-prefer-compl_zfs-serv1" rsc="compl_zfs-serv1" role="Started"
> node="zfs-serv2" score="INFINITY"/>
> Oct 11 13:55:58 [3556] zfs-serv2cib: info:
> cib_process_request: Completed cib_modify operation for section
> constraints: OK (rc=0, origin=zfs-serv1/crm_resource/4,
> version=0.108.0)
> Oct 11 13:55:58 [3561] zfs-serv2   crmd: info:
> abort_transition_graph:  Transition aborted by rsc_location.cli-
> prefer-compl_zfs-serv1 'create': Configuration change | cib=0.108.0
> source=te_update_diff:444 path=/cib/configuration/constraints
> complete=true
> Oct 11 13:55:58 [3560] zfs-serv2pengine:   notice:
> unpack_config:   On loss of CCM Quorum: Ignore
> Oct 11 13:55:58 [3560] zfs-serv2pengine: info:
> determine_online_status: Node zfs-serv2 is online
> Oct 11 13:55:58 [3560] zfs-serv2pengine: info:
> determine_online_status: Node zfs-serv1 is online
> Oct 11 13:55:58 [3560] zfs-serv2pengine: info:
> determine_op_status: Operation monitor found resource nc_storage
> active on zfs-serv2
> Oct 11 13:55:58 [3560] zfs-serv2pengine: info:
> native_pr

[ClusterLabs] Mysql upgrade in DRBD setup

2017-10-12 Thread Attila Megyeri
Hi all,

What is the recommended mysql server upgrade methodology in case of an 
active/passive DRBD storage?
(Ubuntu is the platform)


1)  On the passive node the mysql data directory is not mounted, so the 
backup fails (some postinstall jobs will attempt to perform manipulations on 
certain files in the data directory).

2)  If the upgrade is done on the active node, it will restart the service 
(with the service restart, not in a crm managed fassion...), which is not a 
very good option (downtime in a HA solution). Not to mention, that it will 
update some files in the mysql data directory, which can cause strange issues 
if the A/P pair is changed - since on the other node the program code will 
still be the old one, while the data dir is already upgraded.

Any hints are welcome!

Thanks,
Attila

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Mysql upgrade in DRBD setup

2017-10-12 Thread Kristián Feldsam
hello, you should put cluster to maintenance mode



Sent from my MI 5On Attila Megyeri , Oct 12, 2017 6:55 PM wrote:Hi all, What is the recommended mysql server upgrade methodology in case of an active/passive DRBD storage?(Ubuntu is the platform) 1)      On the passive node the mysql data directory is not mounted, so the backup fails (some postinstall jobs will attempt to perform manipulations on certain files in the data directory).2)      If the upgrade is done on the active node, it will restart the service (with the service restart, not in a crm managed fassion…), which is not a very good option (downtime in a HA solution). Not to mention, that it will update some files in the mysql data directory, which can cause strange issues if the A/P pair is changed – since on the other node the program code will still be the old one, while the data dir is already upgraded. Any hints are welcome! Thanks,Attila ___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Mysql upgrade in DRBD setup

2017-10-12 Thread Attila Megyeri
Hi, This does not really answer my question. Placing the cluster into 
maintenance mode just avoids monitoring and restarting, but what about the 
things I am asking below? (data dir related questions)

thanks


From: Kristián Feldsam [mailto:ad...@feldhost.cz]
Sent: Thursday, October 12, 2017 7:51 PM
To: Attila Megyeri ; users@clusterlabs.org
Subject: Re:[ClusterLabs] Mysql upgrade in DRBD setup

hello, you should put cluster to maintenance mode



Sent from my MI 5
On Attila Megyeri 
mailto:amegy...@minerva-soft.com>>, Oct 12, 2017 
6:55 PM wrote:
Hi all,

What is the recommended mysql server upgrade methodology in case of an 
active/passive DRBD storage?
(Ubuntu is the platform)


1)  On the passive node the mysql data directory is not mounted, so the 
backup fails (some postinstall jobs will attempt to perform manipulations on 
certain files in the data directory).

2)  If the upgrade is done on the active node, it will restart the service 
(with the service restart, not in a crm managed fassion…), which is not a very 
good option (downtime in a HA solution). Not to mention, that it will update 
some files in the mysql data directory, which can cause strange issues if the 
A/P pair is changed – since on the other node the program code will still be 
the old one, while the data dir is already upgraded.

Any hints are welcome!

Thanks,
Attila

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Mysql upgrade in DRBD setup

2017-10-12 Thread Kristián Feldsam
Hello, I think, that only way is to upgrade manually and can not without 
downtime. My way will be to put cluster to maintanance mode and perform upgrade 
on active node with mysql_upgrade cmd. On pasive node should be enough to just 
install newer mysql without starting it and running mysql_upgrade. But you can 
make second node primary, mount fs and them run standard mysql upgrade lik eon 
active node was.

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 12 Oct 2017, at 20:42, Attila Megyeri  wrote:
> 
> Hi, This does not really answer my question. Placing the cluster into 
> maintenance mode just avoids monitoring and restarting, but what about the 
> things I am asking below? (data dir related questions)
>  
> thanks
>  
>   <>
> From: Kristián Feldsam [mailto:ad...@feldhost.cz] 
> Sent: Thursday, October 12, 2017 7:51 PM
> To: Attila Megyeri ; users@clusterlabs.org
> Subject: Re:[ClusterLabs] Mysql upgrade in DRBD setup
>  
> hello, you should put cluster to maintenance mode
>  
>  
>  
> Sent from my MI 5
> On Attila Megyeri  >, Oct 12, 2017 6:55 PM wrote:
> Hi all,
>  
> What is the recommended mysql server upgrade methodology in case of an 
> active/passive DRBD storage?
> (Ubuntu is the platform)
>  
> 1)  On the passive node the mysql data directory is not mounted, so the 
> backup fails (some postinstall jobs will attempt to perform manipulations on 
> certain files in the data directory).
> 2)  If the upgrade is done on the active node, it will restart the 
> service (with the service restart, not in a crm managed fassion…), which is 
> not a very good option (downtime in a HA solution). Not to mention, that it 
> will update some files in the mysql data directory, which can cause strange 
> issues if the A/P pair is changed – since on the other node the program code 
> will still be the old one, while the data dir is already upgraded.
>  
> Any hints are welcome!
>  
> Thanks,
> Attila

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Mysql upgrade in DRBD setup

2017-10-12 Thread Ken Gaillot
On Thu, 2017-10-12 at 18:51 +0200, Attila Megyeri wrote:
> Hi all,
>  
> What is the recommended mysql server upgrade methodology in case of
> an active/passive DRBD storage?
> (Ubuntu is the platform)

If you want to minimize downtime in a MySQL upgrade, your best bet is
to use MySQL native replication rather than replicate the storage.

1. starting point: node1 = master, node2 = slave
2. stop mysql on node2, upgrade, start mysql again, ensure OK
3. switch master to node2 and slave to node1, ensure OK
4. stop mysql on node1, upgrade, start mysql again, ensure OK

You might have a small window where the database is read-only while you
switch masters (you can keep it to a few seconds if you arrange things
well), but other than that, you won't have any downtime, even if some
part of the upgrade gives you trouble.

>  
> 1)  On the passive node the mysql data directory is not mounted,
> so the backup fails (some postinstall jobs will attempt to perform
> manipulations on certain files in the data directory).
> 2)  If the upgrade is done on the active node, it will restart
> the service (with the service restart, not in a crm managed
> fassion…), which is not a very good option (downtime in a HA
> solution). Not to mention, that it will update some files in the
> mysql data directory, which can cause strange issues if the A/P pair
> is changed – since on the other node the program code will still be
> the old one, while the data dir is already upgraded.
>  
> Any hints are welcome!
>  
> Thanks,
> Attila
>  
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org