On Tue, May 07, 2019 at 09:59:03AM +0300, Klecho wrote:
> During the weekend my corosync daemon suddenly died without anything in the
> logs, except this:
>
> May 5 20:39:16 ZZZ kernel: [1605277.136049] traps: corosync[2811] trap
> invalid opcode ip:5635c376f2eb sp:7ffc3e109950 error:0 in
> coros
On Wed, Apr 03, 2019 at 10:36:52AM +0300, Andrei Borzenkov wrote:
> I assume this is path failover time? As I doubt storage latency can be
> that high?
>
> I wonder, does IBM have official guidelines for integrating SBD with
> their storage? Otherwise where this requirement comes from?
Yes, we ha
On Wed, Apr 03, 2019 at 09:13:58AM +0200, Ulrich Windl wrote:
> I'm surprised: Once sbd writes the fence command, it usually takes
> less than 3 seconds until the victim is dead. If you power off a
> server, the PDU still may have one or two seconds "power reserve", so
> the host may not be down im
On Wed, Mar 27, 2019 at 08:13:07AM +0100, Ulrich Windl wrote:
> Seems to be traditional. Maybe it's just to create some namespace.
Yes, I think GFS2 and OCFS2 store cluster name in FS metadata.
If the cluster name changes they might not mount anymore:
https://access.redhat.com/solutions/18430
--
On Thu, Mar 21, 2019 at 08:00:05AM +0100, Ulrich Windl wrote:
> Actually it makes no difference to a non-clustered local disk: If the buffers
> are not flushed, data can get lost if there is a power failure. If you use
> sync
> writes, the data should be on disk, and I guess with DRBD the data sho
On Wed, Mar 20, 2019 at 07:31:02PM +0100, Valentin Vidic wrote:
> Right, but I'm not sure how this would help in the above situation
> unless the DRBD can undo the local write that did not succeed on the
> peer?
Ah, it seems the activity log handles the undo by storing the
location
On Wed, Mar 20, 2019 at 02:01:07PM -0400, Digimer wrote:
> On 2019-03-20 2:00 p.m., Valentin Vidic wrote:
> > On Wed, Mar 20, 2019 at 01:47:56PM -0400, Digimer wrote:
> >> Not when DRBD is configured correctly. You sent 'fencing
> >> resource-and-stonith;'
On Wed, Mar 20, 2019 at 01:47:56PM -0400, Digimer wrote:
> Not when DRBD is configured correctly. You sent 'fencing
> resource-and-stonith;' and set the appropriate fence handler. This tells
> DRBD to not proceed with a write while a node is in an unknown state
> (which happens when the node stops
On Wed, Mar 20, 2019 at 01:44:06PM -0400, Digimer wrote:
> GFS2 notified the peers of disk changes, and DRBD handles actually
> copying to changes to the peer.
>
> Think of DRBD, in this context, as being mdadm RAID, like how writing to
> /dev/md0 is handled behind the scenes to write to both /dev
On Wed, Mar 20, 2019 at 01:34:52PM -0400, Digimer wrote:
> Depending on your fail-over tolerances, I might add NFS to the mix and
> have the NFS server run on one node or the other, exporting your ext4 FS
> that sits on DRBD in single-primary mode.
>
> The failover (if the NFS host died) would loo
On Wed, Mar 20, 2019 at 12:37:21PM -0400, Digimer wrote:
> Cluster filesystems are amazing if you need them, and to be avoided if
> at all possible. The overhead from the cluster locking hurts performance
> quite a lot, and adds a non-trivial layer of complexity.
>
> I say this as someone who
On Wed, Mar 20, 2019 at 09:36:58AM -0600, JCA wrote:
> # pcs -f fs_cfg resource create TestFS Filesystem device="/dev/drbd1"
> directory="/tmp/Testing"
> fstype="ext4"
ext4 can only be mounted on one node at a time. If you need to access
files on both nodes at the same time than a clu
On Sat, Feb 16, 2019 at 10:23:17PM +, Eric Robinson wrote:
> I'm looking through the docs but I don't see how to set the on-fail value for
> a resource.
It is not set on the resource itself but on each of the actions (monitor,
start, stop).
--
Valentin
___
On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote:
> I just noticed that. I also noticed that the lsb init script has a
> hard-coded stop timeout of 30 seconds. So if the init script waits
> longer than the cluster resource timeout of 15s, that would cause the
Yes, you should use highe
On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote:
> Here are the relevant corosync logs.
>
> It appears that the stop action for resource p_mysql_002 failed, and
> that caused a cascading series of service changes. However, I don't
> understand why, since no other resources are depend
On Sat, Feb 16, 2019 at 08:50:57PM +, Eric Robinson wrote:
> Which logs? You mean /var/log/cluster/corosync.log?
On the DC node pacemaker will be logging the actions it is trying
to run (start or stop some resources).
> But even if the stop action is resulting in an error, why would the
> clu
On Sat, Feb 16, 2019 at 08:34:21PM +, Eric Robinson wrote:
> Why is it that when one of the resources that start with p_mysql_*
> goes into a FAILED state, all the other MySQL services also stop?
Perhaps stop is not working correctly for these lsb services, so for
example stopping lsb:mysql_00
On Tue, Feb 12, 2019 at 08:00:38PM +0100, Kristoffer Grönlund wrote:
> One final note: hawk-apiserver uses a project called go-pacemaker
> located at https://github.com/krig/go-pacemaker. I indend to transfer
> this to ClusterLabs as well. go-pacemaker is still somewhat rough around
> the edges, an
On Wed, Jan 16, 2019 at 04:20:03PM +0100, Valentin Vidic wrote:
> I think drbd always calls crm-fence-peer.sh when it becomes disconnected
> primary. In this case storage1 has closed the DRBD connection and
> storage2 has become a disconnected primary.
>
> Maybe the problem is the
On Wed, Jan 16, 2019 at 09:03:21AM -0600, Bryan K. Walton wrote:
> The exit code 4 would seem to suggest that storage1 should be fenced.
> But the switch ports connected to storage1 are still enabled.
>
> Am I misreading the logs here? This is a clean reboot, maybe fencing
> isn't supposed to hap
On Wed, Jan 16, 2019 at 12:41:11PM +0100, Valentin Vidic wrote:
> This is what pacemaker says about the resource restarts:
>
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start dlm:1
> ( node2 )
> Jan 16 11:19:08 node1 pacemaker-scheduler
On Wed, Jan 16, 2019 at 12:16:04PM +, Andrew Price wrote:
> The only thing that stands out to me with this config is the lack of
> ordering constraint between dlm and lvmlockd. Not sure if that's the issue
> though.
They are both in the storage group so the order should be dlm than lockd?
On Wed, Jan 16, 2019 at 12:28:59PM +0100, Valentin Vidic wrote:
> When node2 is set to standby resource stop running there. However when
> node2 is brought back online, it causes the resources on node1 to stop
> and than start again which is a bit unexpected?
>
> Maybe the depende
Hi all,
I'm testing the following configuration with two nodes:
Clone: storage-clone
Meta Attrs: interleave=true target-role=Started
Group: storage
Resource: dlm (class=ocf provider=pacemaker type=controld)
Resource: lockd (class=ocf provider=heartbeat type=lvmlockd)
Clone: gfs2-clon
On Fri, Jan 11, 2019 at 12:42:02PM +0100, wf...@niif.hu wrote:
> I opened https://github.com/ClusterLabs/sbd/pull/62 with our current
> patches, but I'm just a middle man here. Valentin, do you agree to
> upstream these two remaining patches of yours?
Sure thing, merge anything you can...
--
Va
On Thu, Jan 03, 2019 at 04:56:26PM -0600, Ken Gaillot wrote:
> Right -- not only that, but corosync 1 (CentOS 6) and corosync 2
> (CentOS 7) are not compatible for running in the same cluster.
I suppose it is the same situation for upgrading from corosync 2
to corosync 3?
--
Valentin
___
On Tue, Nov 13, 2018 at 11:01:46AM -0600, Ken Gaillot wrote:
> Clone instances have a default stickiness of 1 (instead of the usual 0)
> so that they aren't needlessly shuffled around nodes every transition.
> You can temporarily set an explicit stickiness of 0 to let them
> rebalance, then unset i
On Tue, Nov 13, 2018 at 05:04:19PM +0100, Valentin Vidic wrote:
> Also it seems to require multicast, so better check for that too :)
And while the CLUSTERIP resource seems to work for me in a test
cluster, the following clone definition:
clone cip-clone cip \
meta clone-max=2 cl
On Tue, Nov 13, 2018 at 04:06:34PM +0100, Valentin Vidic wrote:
> Could be some kind of ARP inspection going on in the networking equipment,
> so check switch logs if you have access to that.
Also it seems to require multicast, so better check for that too :)
--
Va
On Tue, Nov 13, 2018 at 09:06:56AM -0500, Daniel Ragle wrote:
> Thanks, finally getting back to this. Putting a tshark on both nodes and
> then restarting the VIP-clone resource shows the pings coming through for 12
> seconds, always on node2, then stop. I.E., before/after those 12 seconds
> nothin
On Fri, Oct 19, 2018 at 11:09:34AM +0200, Kristoffer Grönlund wrote:
> I wonder if perhaps there was a configuration change as well, since the
> return code seems to be configuration related. Maybe something changed
> in the build scripts that moved something around? Wild guess, but...
Seems to be
On Wed, Oct 17, 2018 at 12:03:18PM +0200, Oyvind Albrigtsen wrote:
> - apache: retry PID check.
I noticed that the ocft test started failing for apache in this
version. Not sure if the test is broken or the agent. Can you
check if the test still works for you? Restoring the previous
version of th
On Wed, Oct 10, 2018 at 02:36:21PM +0200, Stefan K wrote:
> I think my config is correct, but it sill fails with "This Target
> already exists in configFS" but "targetcli ls" shows nothing.
It seems to find something in /sys/kernel/config/target. Maybe it
was setup outside of pacemaker somehow?
On Thu, Oct 11, 2018 at 01:25:52PM -0400, Daniel Ragle wrote:
> For the 12 second window it *does* work in, it appears as though it works
> only on one of the two servers (and always the same one). My twelve seconds
> of pings runs continuously then stops; while attempts to hit the Web server
> wor
On Tue, Oct 09, 2018 at 12:07:38PM +0200, Oyvind Albrigtsen wrote:
> I've created a PR for the library detection and try/except imports:
> https://github.com/ClusterLabs/fence-agents/pull/242
Thanks, I will give it a try right away...
--
Valentin
___
U
On Tue, Oct 09, 2018 at 10:55:08AM +0200, Oyvind Albrigtsen wrote:
> It seems like the if-line should be updated to check for those 2
> libraries (from the imports in the agent).
Yes, that might work too.
Also would it be possible to make the imports in openstack agent
conditional so the metadata
On Tue, Oct 02, 2018 at 03:13:51PM +0200, Oyvind Albrigtsen wrote:
> ClusterLabs is happy to announce fence-agents v4.3.0.
>
> The source code is available at:
> https://github.com/ClusterLabs/fence-agents/releases/tag/v4.3.0
>
> The most significant enhancements in this release are:
> - new fenc
On Fri, Oct 05, 2018 at 11:34:10AM -0500, Ken Gaillot wrote:
> The next big challenge is that high availability is becoming a subset
> of the "orchestration" space in terms of how we fit into IT
> departments. Systemd and Kubernetes are the clear leaders in service
> orchestration today and likely
On Thu, Sep 06, 2018 at 04:47:32PM -0400, Digimer wrote:
> It depends on the hardware you have available. In your case, RPi has no
> IPMI or similar feature, so you'll need something external, like a
> switched PDU. I like the APC AP7900 (or your countries variant), which
> you can often get used f
On Tue, Sep 11, 2018 at 09:31:13AM -0400, Patrick Whitney wrote:
> But, when I invoke the "human" stonith power device (i.e. I turn the node
> off), the other node collapses...
>
> In the logs I supplied, I basically do this:
>
> 1. stonith fence (With fence scsi)
After fence_scsi finishes the n
On Tue, Sep 11, 2018 at 04:14:08PM +0300, Vladislav Bogdanov wrote:
> And that is not an easy task sometimes, because main part of dlm runs in
> kernel.
> In some circumstances the only option is to forcibly reset the node.
Exactly, killing the power on the node will stop the DLM code running in
t
On Tue, Sep 11, 2018 at 09:13:08AM -0400, Patrick Whitney wrote:
> So when the cluster suggests that DLM is shutdown on coro-test-1:
> Clone Set: dlm-clone [dlm]
> Started: [ coro-test-2 ]
> Stopped: [ coro-test-1 ]
>
> ... DLM isn't actually stopped on 1?
If you can connect to the node
On Tue, Sep 11, 2018 at 09:02:06AM -0400, Patrick Whitney wrote:
> What I'm having trouble understanding is why dlm flattens the remaining
> "running" node when the already fenced node is shutdown... I'm having
> trouble understanding how power fencing would cause dlm to behave any
> differently t
On Thu, May 24, 2018 at 12:16:16AM -0600, Casey & Gina wrote:
> Tried that, it doesn't seem to do anything but prefix the lines with the pid:
>
> [pid 24923] sched_yield() = 0
> [pid 24923] sched_yield() = 0
> [pid 24923] sched_yield() = 0
We managed to t
On Wed, Jul 11, 2018 at 04:31:31PM -0600, Casey & Gina wrote:
> Forgive me for interjecting, but how did you upgrade on Ubuntu? I'm
> frustrated with limitations in 1.1.14 (particularly in PCS so not sure
> if it's relevant), and Ubuntu is ignoring my bug reports, so it would
> be great to upgrade
On Wed, Jul 11, 2018 at 08:01:46PM +0200, Salvatore D'angelo wrote:
> Yes, but doing what you suggested the system find that sysV is
> installed and try to leverage on update-rc.d scripts and the failure
> occurs:
>
> root@pg1:~# systemctl enable corosync
> corosync.service is not a native service
On Tue, Apr 03, 2018 at 04:48:00PM +0200, Stefan Friedel wrote:
> we've a running drbd - iscsi cluster (two nodes Debian stretch, pacemaker /
> corosync, res group w/ ip + iscsitarget/lio-t + iscsiluns + lvm etc. on top of
> drbd etc.). Everything is running fine - but we didn't manage to get CHAP
On Thu, Mar 22, 2018 at 03:36:55PM -0400, Alberto Mijares wrote:
> Straight to the question: how can I manually run a resource agent
> script (kamailio) simulating the pacemaker's environment without
> actually having pacemaker running?
You should be able to run it with something like:
# OCF_ROOT
On Mon, Mar 12, 2018 at 04:31:46PM +0100, Klaus Wenninger wrote:
> Nope. Whenever the cluster is completely down...
> Otherwise nodes would come up - if not seeing each other -
> happily with both starting all services because they don't
> know what already had been running on the other node.
> Tec
On Mon, Mar 12, 2018 at 01:58:21PM +0100, Klaus Wenninger wrote:
> But isn't dlm directly interfering with corosync so
> that it would get the quorum state from there?
> As you have 2-node set probably on a 2-node-cluster
> this would - after both nodes down - wait for all
> nodes up first.
Isn't
On Wed, Oct 11, 2017 at 02:36:24PM +0200, Valentin Vidic wrote:
> AFAICT, it found a better interface with that subnet and tried
> to use it instead of the one specified in the parameters :)
>
> But maybe IPaddr2 should just skip interface auto-detection
> if an explicit interfa
On Wed, Oct 11, 2017 at 01:29:40PM +0200, Stefan Krueger wrote:
> ohh damn.. thanks a lot for this hint.. I delete all the IPs on enp4s0f0, and
> than it works..
> but could you please explain why it now works? why he has a problem with this
> IPs?
AFAICT, it found a better interface with that s
On Wed, Oct 11, 2017 at 10:51:04AM +0200, Stefan Krueger wrote:
> primitive HA_IP-Serv1 IPaddr2 \
> params ip=172.16.101.70 cidr_netmask=16 \
> op monitor interval=20 timeout=30 on-fail=restart nic=bond0 \
> meta target-role=Started
There might be something wrong with the n
On Tue, Oct 10, 2017 at 11:26:24AM +0200, Václav Mach wrote:
> # The primary network interface
> allow-hotplug eth0
> iface eth0 inet dhcp
> # This is an autoconfigured IPv6 interface
> iface eth0 inet6 auto
allow-hotplug or dhcp could be causing problems. You can try
disabling corosync and pacem
On Tue, Oct 10, 2017 at 10:35:17AM +0200, Václav Mach wrote:
> Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB] Denied
> connection, is not ready (709-1337-18)
> Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB] Denied
> connection, is not ready (709-1337-18)
> Oct 10 10:27
On Thu, Oct 05, 2017 at 08:55:59PM +0200, Jehan-Guillaume de Rorthais wrote:
> It doesn't seems impossible, however I'm not sure of the complexity around
> this.
>
> You would have to either hack PAF and detect failover/migration or create a
> new
> RA that would always be part of the transition
On Tue, Sep 12, 2017 at 04:48:19PM +0200, Jehan-Guillaume de Rorthais wrote:
> PostgreSQL Automatic Failover (PAF) v2.2.0 has been released on September
> 12th 2017 under the PostgreSQL licence.
>
> See: https://github.com/dalibo/PAF/releases/tag/v2.2.0
>
> PAF is a PostgreSQL resource agent for
On Mon, Sep 11, 2017 at 04:18:08PM +0200, Klaus Wenninger wrote:
> Just for my understanding: You are using watchdog-handling in corosync?
Corosync package in Debian gets build with --enable-watchdog so by
default it takes /dev/watchdog during runtime. Don't think SUSE
or RedHat packages get buil
On Sun, Sep 10, 2017 at 08:27:47AM +0200, Ferenc Wágner wrote:
> Confirmed: setting watchdog_device: off cluster wide got rid of the
> above warnings.
Interesting, what brand or version of IPMI has this problem?
--
Valentin
___
Users mailing list: Use
On Fri, Sep 08, 2017 at 09:39:26PM +0100, Andrew Cooper wrote:
> Yes. The internal mechanism of the host watchdog is to use one
> performance counter to count retired instructions and generate an NMI
> roughly once every half second (give or take C and P states).
>
> Separately, there is a one se
On Fri, Sep 08, 2017 at 12:57:12PM +, Mark Syms wrote:
> As we discussed regarding the handling of watchdog in XenServer, both
> guest and host, I've had a discussion with our subject matter expert
> (Andrew, cc'd) on this topic. The guest watchdogs are handled by a
> hardware timer in the hype
On Mon, Aug 28, 2017 at 04:10:50AM +0200, Oscar Segarra wrote:
> In Ceph, by design there is no single point of failure I terms of server
> roles, nevertheless, from the client point of view, it might exist.
>
> In my environment:
> Mon1: 192.168.100.101:6789
> Mon2: 192.168.100.102:6789
> Mon3:
On Mon, Jul 24, 2017 at 10:38:40AM -0500, Ken Gaillot wrote:
> Standby is not necessary, it's just a cautious step that allows the
> admin to verify that all resources moved off correctly. The restart that
> yum does should be sufficient for pacemaker to move everything.
>
> A restart shouldn't le
On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote:
> Lsof/fuser show the PID of the process holding FS open as "kernel".
That could be the NFS server running in the kernel.
--
Valentin
___
Users mailing list: Users@clusterlabs.org
http://
On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote:
> So yesterday I ran yum update that puled in the new pacemaker and tried to
> restart it. The node went into its usual "can't unmount drbd because kernel
> is using it" and got stonith'ed in the middle of yum transaction. The end
> res
On Fri, Jun 30, 2017 at 12:46:29PM -0500, Ken Gaillot wrote:
> The challenge is that some properties are docker-specific and other
> container engines will have their own specific properties.
>
> We decided to go with a tag for each supported engine -- so if we add
> support for rkt, we'll add a
On Fri, Mar 31, 2017 at 05:43:02PM -0500, Ken Gaillot wrote:
> Here's an example of the CIB XML syntax (higher-level tools will likely
> provide a more convenient interface):
>
>
>
>
Would it be possible to make this a bit more generic like:
so we have support for other container engin
On Thu, Jan 26, 2017 at 09:31:23PM +0100, Valentin Vidic wrote:
> Guess you could create a Dummy resource and make INIFINITY colloction
> constraints for the IPs so they follow Dummy as it moves between the
> nodes :)
In fact using resource sets this becomes one rule:
colocation ip
On Thu, Jan 26, 2017 at 12:10:24PM +0100, Arturo Borrero Gonzalez wrote:
> I have a rather simple 2 nodes active/active router using pacemaker+corosync.
>
> Why active-active? Well, one node holds the virtual IPv4 resources and
> the other node holds the virtual IPv6 resources.
> On failover, both
On Mon, Jul 25, 2016 at 07:58:51PM +0200, Thierry Boibary wrote:
> is "Pacemaker" available on Debian 8.1?
Only via jessie-backports, as you can see here:
https://packages.debian.org/search?keywords=pacemaker
--
Valentin
___
Users mailing list: Use
FAIL: test_run_all_workers (pcs.test.test_utils.RunParallelTest)
--
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pcs/test/test_utils.py", line
1800, in test_run_all_workers
self.assertEqual(log, [
On Thu, Jun 30, 2016 at 01:27:25PM +0200, Tomas Jelinek wrote:
> It seems eventmachine can be safely dropped as all tests passed without it.
Great, thanks for confirming.
--
Valentin
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org
On Wed, Jun 29, 2016 at 10:31:42AM +0200, Tomas Jelinek wrote:
> This should be replaceable by any agent which does not provide unfencing,
> i.e. it does not have on_target="1" automatic="1" attributes in name="on" /> . You may need to experiment with few agents to find one which
> works.
Just ch
On Tue, Jun 28, 2016 at 02:35:53PM +0200, Tomas Jelinek wrote:
> You are right. The right pacemaker (and corosync, resource agents...)
> version is needed for tests to pass. It's not an easy task to figure out
> what the right version is, though. For pcs 0.9.152 it's
> pacemaker-1.1.15-2.el7.
>
>
I'm trying to run pcs tests on Debian unstable, but there
are some strange failures like diffs failing due to an
additional space at the end of the line or just with
"Error: cannot load cluster status, xml does not conform to the schema"
Any idea what could be the issue here? I assume the tests
w
Hi,
Is it safe to drop eventmachine as a dependency? I see it's
only mentioned in the makefiles and not used by any of the
ruby code:
pcsd/Makefile: vendor/cache/eventmachine-1.2.0.1.gem \
pcsd/Gemfile:gem 'eventmachine'
pcsd/Gemfile.lock:eventmachine (1.2.0.1)
pcsd/Gemfile.lock: eventmachi
On Fri, Jan 22, 2016 at 07:57:52PM +0300, Vladislav Bogdanov wrote:
> Tried reverting this one and a51b2bb ("If an error occurs unlink the
> lock file and exit with status 1") one-by-one and both together, the
> same result.
>
> So problem seems to be somewhere deeper.
I've got the same fencing
77 matches
Mail list logo