Re: [DRBD-user] drbd-utils v9.22.0

2022-09-22 Thread Eddie Chapman

On 20/09/2022 08:42, Roland Kammerer wrote:

Dear DRBD users,





The second bigger change is that the rpm part now generates a
drbd-selinux sub package containing an updated SELinux policy. Depending
on the host distribution, that package might even become a runtime
dependency for the drbd-utils sub package. Users/downstream building
packages should add "checkpolicy", and "selinux-policy-devel" to their
build systems. The Debian (alike) world is not affected by that change,
as SELinux isn't that widely used (by default) there.



I have several Gentoo SELinux systems (SELinux has good support on 
Gentoo) and I build drbd-utils for them using Gentoo's ebuild system, 
which uses configure/make. Thanks for adding SELinux policy which sounds 
great! I'm sure I'll be able to figure out how to extract the policy for 
myself but it would be nice if it wasn't tied to the rpm build only :-)


Thanks,
Eddie
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] verify/disconnect/connect doesn't resync?

2021-10-06 Thread Eddie Chapman

On 29/09/2021 01:10, Chris Pacejo wrote:

Hi, I have a three-node active/passive DRBD cluster, operating with default 
configuration.  I had to replace disks on one of the nodes (call it node A) and 
resync the cluster.

Somehow, after doing this, A was not in sync with the primary (node C); I only 
discovered this because I couldn't even mount the filesystem on it after 
(temporarily) making A primary.  I don't fully understand how I got into this 
situation but that's a tangent for now.

Following instructions in the documentation, I enabled a verification algorithm, and 
instructed A to verify (`drbdadm verify `).  It correctly found many 
discrepancies (gigabytes worth!) and emitted the ranges to dmesg.

I then attempted to resynchronize A with C (the primary) by running `drbdadm disconnect 
` and then `drbdadm connect `, again, following 
documentation.  This did not appear to do anything, despite verify having just found nearly 
the entire disk to be out of sync.  Indeed, running verify a second time produced the exact 
same results.

Instead I forced a full resync by bringing A down, invalidating it, and 
bringing it back up again.  Now verification showed A and C to be in sync.


What I usually do in this situation (I believe it's because no writes 
have hit the primary while disconnected), to avoid the drastic step of 
having to completely invalidate a secondary node, is: disconnect the 
secondary, force a tiny change on the primary (e.g. touch and delete an 
empty file on the filesystem, run a filsystem check which updates the fs 
metadata), then reconnect. Of course this forces a resync and, in my 
experience and from what I can tell by the number of Kbs resynced, the 
resync includes the verified blocks found out of sync).




However A was still showing a small number (thousands) of discrepancies with 
node B (the other secondary node).  So I repeated the above steps on B -- 
verify/disconnect/connect/verify -- and again, nothing changed.  B still shows 
discrepancies between it and both A and C.

Running the same steps on node C (the primary) again found discrepancies with 
B, and again failed to resynchronize.

What am I missing?  Is there an additional step needed to convince DRBD to 
resynchronize blocks found to mismatch during verify?

Further questions --

Why does `drbdadm status` not show whether out-of-sync blocks were found by 
`drbdadm verify`?  Instead it shows UpToDate like nothing is wrong.

Why is resynchronization only triggered on reconnect?  Is there a downside to 
simply starting resynchronization when out-of-sync blocks are discovered?


I believe this has just been left for the user to take whatever action 
is desired using the out-of-sync helper. I suppose some people might not 
want any automatic action taken and just have a helper script send them 
a notification so they can manually intervene.


Eddie



Version info:
DRBDADM_BUILDTAG=GIT-hash:\ 5acfd06032d4c511c75c92e58662eeeb18bd47db\ build\ 
by\ ec2-u...@test-cluster-c.cpacejo.test\,\ 2021-07-06\ 20:48:54
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090102
DRBD_KERNEL_VERSION=9.1.2
DRBDADM_VERSION_CODE=0x091200
DRBDADM_VERSION=9.18.0

dmesg logs below.

Thanks,
Chris



___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] protocol C replication - unexpected behaviour

2021-08-10 Thread Eddie Chapman

On 10/08/2021 00:01, Digimer wrote:

On 2021-08-05 5:53 p.m., Janusz Jaskiewicz wrote:

Hello.

I'm experimenting a bit with DRBD in a cluster managed by Pacemaker.
It's a two node, active-passive cluster and the service that I'm 
trying to put in the cluster writes to the file system.
The service manages many files, it appends to some of them and 
increments the single integer value in others.


I'm observing surprising behaviour and I would like to ask you if what 
I see is expected or not (I think not).


I'm using protocol C, but still I see some delay in the files that are 
being replicated to the secondary server.
For the files that increment the integer I see a difference which 
corresponds roughly to 1 second of traffic.


I'm really surprised to see this, as protocol C should guarantee 
synchronous replication.
I'd rather expect some delay in processing (potentially slower disk 
writes due to the network replication).


The way I'm testing it:
The service runs on primary and writes to DRBD drive, secondary 
connected and "UpToDate".
I kill the service abruptly (kill -9) and then take down the network 
interface between primary and secondary (kill and ifdown commands in 
the script so executed quite promptly one after the other).
Then I mount the DRBD drive on both nodes and check the difference in 
the files with incrementing integer.


I would appreciate any help or pointers on how to fix this.
But first of all I would like to confirm that this behaviour is not 
expected.


Also if it is expected/allowed, how can I decrease the impact?



What filesystem are you using? Is it cluster / multi-node aware?


The filesystem may be relevant in that filesystems can behave in ways 
one might not expect, depending on how they are tuned, so would be good 
to know what the fs is and what mount options are being used. However, 
the filesystem certainly does not need to be aware that the underlying 
block device is drbd or a cluster of any kind. A drbd device should look 
like a regular block device and there is no need to treat it like 
anything else.


I would like to know the kernel and drbd versions, if these are old 
enough then expecting things to "work" in a sane fashion might not be a 
reasonable expectation :-)

___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] The Problem of File System Corruption w/DRBD

2021-06-03 Thread Eddie Chapman

On 03/06/2021 13:50, Eric Robinson wrote:

-Original Message-
From: Digimer 
Sent: Wednesday, June 2, 2021 7:23 PM
To: Eric Robinson ; drbd-user@lists.linbit.com
Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD

On 2021-06-02 5:17 p.m., Eric Robinson wrote:

Since DRBD lives below the filesystem, if the filesystem gets
corrupted, then DRBD faithfully replicates the corruption to the other
node. Thus the filesystem is the SPOF in an otherwise shared-nothing

architecture.

What is the recommended way (if there is one) to avoid the filesystem
SPOF problem when clusters are based on DRBD?

-Eric


To start, HA, like RAID, is not a replacement for backups. That is the answer
to a situation like this... HA (and other availability systems like RAID) 
protect
against component failure. If a node fails, the peer recovers automatically
and your services stay online. That's what DRBD and other HA solutions strive
to provide; uptime.

If you want to protect against corruption (accidental or intentional, a-la
cryptolockers), you need a robust backup system to _compliment_ your HA
solution.



Yes, thanks, I've said for many years that HA is not a replacement for disaster 
recovery. Still, it is better to avoid downtime than to recover from it, and 
one of the main ways to achieve that is through redundancy, preferably a 
shared-nothing approach. If I have a cool 5-node cluster and the whole thing 
goes down because the filesystem gets corrupted, I can restore from backup, but 
management is going to wonder why a 5-node cluster could not provide 
availability. So the question remains: how to eliminate the filesystem as the 
SPOF?



Some of the things being discussed here have nothing to do with drbd. 
drbd provides a raw block level device. It knows nothing about nor cares 
what layers you place above it, whether they be filesystems or some 
other block layer such as LVM or bcache.


It does a very specific job; ensure the blocks you write to a drbd 
device get replicated and stored in real time on one or more other 
distributed hosts. If you write a 512byte size block of random garbage 
to a drbd device it will (and should) write the exact same garbage to 
the other distributed hosts too, so that if you read that same 512byte 
block back from any 1 of those individual hosts, you'll get the exact 
same garbage back.


The OP stated "if the filesystem gets corrupted, then DRBD faithfully 
replicates the corruption to the other node." Good! That's exactly what 
we want it to do. What we definitely do NOT want is for drbd to 
manipulate the block data given to it in any way whatsoever, we want it 
to faithfully replicate this.

___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] building v9

2020-12-10 Thread Eddie Chapman

On 10/12/2020 12:07, Christoph Böhmwalder wrote:

Hi Pierre,

As much as we may want it, DRBD's coccinelle-based compat system is not 
a general purpose solution. We can't guarantee that DRBD will build for 
any given kernel – there is simply too much going on in the block layer 
and other parts of the kernel, so that we cannnot possibly cover for all 
those different combinations (and still expect DRBD to work as intended).


So we have made a bit of a compromise: we build and test DRBD for a 
defined set of kernels. These are deemed "most interesting", according 
to the opinion of LINBIT and our customers. Namely, we currently build 
for these 125 kernels on the amd64 architecture at time of writing:


Distribution  | Kernel Version
  | --
amazonlinux2.0-amd64  | 4.14.128-112.105.amzn2
debian-buster-amd64   | 4.19.0-5; 4.19.0-6; 4.19.0-8
debian-jessie-amd64   | 3.16.0-4; 3.16.0-5; 3.16.0-6
debian-stretch-amd64  | 4.9.0-6; 4.9.0-7; 4.9.0-8; 4.9.0-9; 4.9.0-11
oracle6.0-amd64   | 4.1.12-124.26.3.el6uek; 4.1.12-124.21.1.el6uek
oracle7.0-amd64   | 4.14.35-1844.1.3.el7uek; 4.1.12-94.3.9.el7uek; 
4.1.12-124.26.10.el7uek; 4.14.35-1902.4.8.el7uek; 4.14.35-1818.3.3.el7uek

oracle8.0-amd64   | 5.4.17-2011.0.7.el8uek
rhel6.10-amd64    | 2.6.32-754.el6
rhel6.6-amd64 | 2.6.32-504.el6
rhel6.7-amd64 | 2.6.32-573.1.1.el6
rhel6.8-amd64 | 2.6.32-642.1.1.el6
rhel6.9-amd64 | 2.6.32-696.el6; 2.6.32-696.23.1.el6; 
2.6.32-696.30.1.el6
rhel7-xen-amd64   | 4.9.188-35.el7; 4.9.199-35.el7; 4.9.206-36.el7; 
4.9.212-36.el7; 4.9.215-36.el7

rhel7.0-amd64 | 3.10.0-123.20.1.el7
rhel7.1-amd64 | 3.10.0-229.1.2.el7
rhel7.2-amd64 | 3.10.0-327.el7
rhel7.3-amd64 | 3.10.0-514.6.2.el7; 3.10.0-514.36.5.el7
rhel7.4-amd64 | 3.10.0-693.el7; 3.10.0-693.17.1.el7; 
3.10.0-693.21.1.el7

rhel7.5-amd64 | 3.10.0-862.el7
rhel7.6-amd64 | 3.10.0-957.el7
rhel7.7-amd64 | 3.10.0-1049.el7; 3.10.0-1062.el7
rhel7.8-amd64 | 3.10.0-1127.el7
rhel7.9-amd64 | 3.10.0-1160.el7
rhel8.0-amd64 | 4.18.0-80.1.2.el8_0
rhel8.1-amd64 | 4.18.0-147.el8
rhel8.2-amd64 | 4.18.0-193.el8
rhel8.3-amd64 | 4.18.0-240.1.1.el8_3
sles11-sp4-amd64  | 3.0.101-108.13.1
sles12-sp2-amd64  | 4.4.74-92.38.1
sles12-sp3-amd64  | 4.4.92-6.30.1
sles12-sp4-amd64  | 4.12.14-95.3.1
sles12-sp5-amd64  | 4.12.14-120.1
sles15-sp0-amd64  | 4.12.14-25.25.1
sles15-sp1-amd64  | 4.12.14-197.29
sles15-sp2-amd64  | 5.3.18-22.2
ubuntu-bionic-amd64   | ✗ 5.3.0-1034-aws; ✗ 5.3.0-1035-aws; 
5.4.0-1025-aws; 5.4.0-1028-aws; 5.4.0-1029-aws; 5.4.0-1030-aws; 
4.15.0-1007-aws
ubuntu-bionic-amd64   | ✗ 5.3.0-1035-azure; ✗ 5.4.0-1023-azure; 
5.4.0-1025-azure; 5.4.0-1026-azure; 5.4.0-1031-azure; 5.4.0-1032-azure; 
4.15.0-1009-azure

ubuntu-bionic-amd64   | 4.15.0-112-lowlatency
ubuntu-bionic-amd64   | ✗ 4.15.0-118; ✗ 4.15.0-121; 4.15.0-122; 
4.15.0-123; 4.15.0-124; 4.15.0-126; 4.15.0-20
ubuntu-focal-amd64    | ✗ 5.4.0-1022-aws; ✗ 5.4.0-1024-aws; 
5.4.0-1025-aws; 5.4.0-1028-aws; 5.4.0-1029-aws; 5.4.0-1030-aws; 
5.4.0-1009-aws
ubuntu-focal-amd64    | ✗ 5.4.0-1022-azure; ✗ 5.4.0-1023-azure; 
5.4.0-1025-azure; 5.4.0-1026-azure; 5.4.0-1031-azure; 5.4.0-1032-azure; 
5.4.0-1010-azure
ubuntu-focal-amd64    | ✗ 5.4.0-51; ✗ 5.4.0-52; 5.4.0-48; 5.4.0-53; 
5.4.0-54; 5.4.0-56; 5.4.0-26

ubuntu-trusty-amd64   | 4.4.0-1022-aws
ubuntu-trusty-amd64   | 3.13.0-129; 3.13.0-133; 3.13.0-139; 3.13.0-142; 
3.13.0-149
ubuntu-xenial-amd64   | 4.4.0-1092-aws; 4.4.0-1098-aws; 4.4.0--aws; 
4.4.0-1114-aws; 4.4.0-1117-aws
ubuntu-xenial-amd64   | 4.13.0-1018-azure; 4.15.0-1036-azure; 
4.15.0-1040-azure
ubuntu-xenial-amd64   | ✗ 4.4.0-190; 4.4.0-193; 4.4.0-194; 4.15.0-120; 
4.15.0-123; 4.4.0-197

xenserver6.5-amd64    | 3.10.41-323
xenserver7.1-amd64    | 4.4.27-572.565306
xenserver7.2-amd64    | 4.4.52-2.1
xenserver8.0-amd64    | 4.19.19-5.0.8


Using one of these kernels will give you the smoothest experience when 
building DRBD. We actually pre-compute all compat patches for these 
kernels and put them in our release tarballs. This means that, if one of 
these kernels is detected, you will not need spatch at all and just need 
to apply a plain patch.


For a lucky set of other kernels, we have made SPAAS (spatch as a 
service). This sends a "fingerprint" of your currently running kernel's 
capabilities to LINBIT, where we can then build a compat patch 
specifically for that kernel. This also works sometimes, but again, we 
cannot possibly guarantee that this works for any given kernel (not to 
mention test it so that it actually does the right thing).


So, in conclusion, you have 2 options:

a) Use one of the kernels we already support
b) Figure out how to have DRBD build for your kernel yourself (it's not 
fun, take my word for it)

c) Become a LINBIT customer and we will gladly do it for you :)


Regarding this 

Re: [DRBD-user] Failed starting drbd after updating Fedora 30

2019-08-13 Thread Eddie Chapman

On 13/08/2019 16:22, Jamie wrote:

On Mon, Aug 13, 2019 at 17:17:31, Jamie wrote:

Sorry, I tried posting this message this morning and something 
apparently didn't work out correctly... :(


Okay, consider me stupid...? Somehow, I couldn't see my post from this 
morning until just a moment ago. I apologize for the double posting. 
*heads desk*


I think it was just queued waiting for moderator approval, as I also 
didn't get your original message from the until about 4.5 hrs after you 
sent it.


Anyway, I believe the issue Roland was referring to is being discussed 
on the Linux kernel mailing list at the moment, I spotted it the other 
day when reading the stable releases list.


https://www.spinics.net/lists/stable/msg319730.html

Christoph Böhmwalder from LINBIT has posted a proposed fix for the 
issue, which is basically a regression in drbd mainline which was 
accidentally introduced by a tree-wide code change, and which you and 
others appear to have been bitten by.


The thing is, Christoph and Roland are just now in the process of 
getting their proposed fix (which isn't a simple revert as it was a 
tree-wide change) accepted by the block subsytem maintainer, and then it 
will end up in Linus Torvald's tree once the maintainer submits his next 
batch of fixes to Linus. Once that happens, you have to wait for it to 
eventually be backported by someone and accepted into one of the 
frequent upstream kernel stable releases (the rule is all stable patches 
must exist in Linus' tree first), and then hopefully Fedora will pick it 
up and apply it to their kernel. Once Fedora has released an updated 
kernel you would then finally see the fix when you install that.


Personally I would just upgrade to drbd 9 and then you don't have to 
wait for the above process :-) (as the issue only affects drbd 8)

___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Failed starting drbd after updating Fedora 30

2019-08-13 Thread Eddie Chapman

On 13/08/2019 13:16, Roland Kammerer wrote:

On Tue, Aug 13, 2019 at 12:40:56PM +0100, Eddie Chapman wrote:

I interpreted the package "drbd.x86_64 - version 9.5.0-2.fc30" as being the
kernel module version 9.5, since what else is left after you've packages the
utilities and udev scripts? There's only the kernel component.


Without reading the rest: no.

There is no such kmod version. And most likely - I never looked at the
Fedora packages - it is meta package that pulls in the user space. We
(LINBIT) also provide a meta package named like this :-/ I don't like it,
it confuses people. And here we have one more prove.

I still assume the kmod comes from your kernel. Again, cat /proc/drbd.
If that shows 9.5, then allow me to join your next time travel...

Regards, rck


Ha ha ha, goodness me, yes I completely forgot the kernel component 
versions numbers are 9.0.x not 9.x. The *utilities* are 9.x. For some 
reason my mind has almost completely stopped seeing the 0 in the middle 
for the kernel module after all these years. :-) A couple mails back I 
even referred to the kernel module as both 9.x and 9.0.x in the same 
sentence, I really am losing it! Apologies if I might have led anyone 
else into insanity with me as well :-)

___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Failed starting drbd after updating Fedora 30

2019-08-13 Thread Eddie Chapman

On 13/08/2019 07:12, Roland Kammerer wrote:

On Mon, Aug 12, 2019 at 12:00:40PM +0200, Jamie wrote:

Hi all,

I've encountered quite a problem after updating my Fedora 30 and I hope
someone might've come across this problem as well because I couldn't really
find a lot of information online.

I'm using the following:

  * drbd.x86_64 - version 9.5.0-2.fc30
  * drbd-udev.x86_64 - version 9.5.0-2.fc30
  * drbd-utils.x86_64 - version 9.5.0-2.fc30

After updating Fedora 30 from kernel version *5.1.19-300*.fc30.x86_64 to
kernel version *5.2.5-200*.fc30.x86_64 the drbd service won't even start
anymore, giving me the following output (names are anonymized):


This is just utils and udev and stuff, but not the actual kmod, right?
The kmod comes from upstream, is that correct? (cat /proc/drbd).
If so, this would explain it. Unfortunately in kernel DRBD broke between
5.1 and 5.2. There are already discussions how to fix it.

Regards, rck


Hi Roland,

I interpreted the package "drbd.x86_64 - version 9.5.0-2.fc30" as being 
the kernel module version 9.5, since what else is left after you've 
packages the utilities and udev scripts? There's only the kernel component.


But my curiosity got the better of me and, in fact, drbd.x86_64 isn't 
anything at all. It's just a metapackage containing just 2 files 
(COPYING and ChangeLog):


https://fedora.pkgs.org/30/fedora-x86_64/drbd-9.5.0-2.fc30.x86_64.rpm.html

The kernel module drbd.ko is in the 
kernel-core-5.0.9-301.fc30.x86_64.rpm package:


https://fedora.pkgs.org/30/fedora-x86_64/kernel-core-5.0.9-301.fc30.x86_64.rpm.html

So as the drbd kernel module is in the main kernel package then, yes, as 
you suspect it has to be drbd8, certainly, as drbd9 is not upstream yet. 
And yes I think you're right the OPs problem is likely the 5.0 to 5.1 
bug being discussed on the kernel.org stable list at the moment.


Surprises me, as Fedora is considered a fairly "bleeding edge" distro, 
that ironically as a user the default position is you end up running 
ancient drbd8. "Yes please, give me the very latest GCC 9, glibc 2.29, 
kernel 5.x, bash 5.0 .. and .. ok I'll take drbd8" :-) I know, joking 
aside, Fedora is just sticking with what comes with the upstream kernel, 
whatever that maybe.


But it is a strange position when you think about it. They're taking the 
utilities from Linbit and packaging them, so why not just take the 
kernel component directly from Linbit too? Then you get the current 
version 9 as well, which isn't even remotely "bleeding edge" by Fedora's 
standards.


To the OP, my personal suggestion: you've chosen Fedora with all its 
shiny new software versions (a good choice I think), so blacklist the 
bundled and ancient drbd.ko and install drbd9, go on, do it :-) 
Seriously, drbd9 is awesome, you'll never look back. It's very easy to 
install from source. In fact Linbit probably has rpm packages, so go 
ahead and uninstall completely the (also ancient by the looks of things) 
drbd utilities rpms, and install the utilities from Linbit too. As a 
bonus you will also resolve your current problem, which is a drbd8 only 
bug. Don't worry, you don't need to "re-do" all your data, you can use 
the existing metadata. Just follow whatever is in Linbit's docs about 
upgrading from 8 to 9.


Regards,
Eddie
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Failed starting drbd after updating Fedora 30

2019-08-12 Thread Eddie Chapman

Hello Jamie,

On 12/08/2019 11:00, Jamie wrote:
//Aug 12 11:28:34 ///serverjm/ drbdadm[8142]: Command 'drbdsetup-84 
new-resource VMNAME' terminated with exit code 20//


I'm not familiar in the slightest with how Fedora packages drbd in your 
installation. However the above line strongly suggests to me that the 
init script that is being run by systemd is a drbd8 script, whereas the 
packages you have installed now appear to be drbd9 (9.5.0-2.fc30).


So evidently upgrading Fedora for you has involved also upgrading from 
drbd8 to drbd9.


(By the way, do those Fedora package version numbers correspond to drbd 
upstream release numbers? If so drbd kernel version 9.5 is quite old and 
lots has been fixed since then. Again I know nothing about how the 
distros package drbd these days as I install from source, but you should 
be running the latest 9.0.19-1 kernel version if you care about your 
data at all, and especially since Fedora ships a very recent kernel 
version.)


There are some changes needed to your drbd configuration files when 
upgrading from 8 to 9. So my suggestion would be (if you really are now 
running drbd9 as you appear to be doing) that you go though those 
configuration files and make the those changes (check the drbd9 docs for 
what is needed), and then perhaps remove the existing v8 style init 
script from systemd and re-add the new v9 one that comes with your 
9.5.0-2.fc30 packages.


regards,
Eddie
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Managed to corrupt a drbd9 resource, but don't fully understand how

2019-07-30 Thread Eddie Chapman

On 29/07/2019 10:34, Eddie Chapman wrote:

Hello,

I've managed to corrupt one of the drbd9 resources on one of my 
production servers, now I'm trying to figure out exactly what happened 
so I can try and recover the data. I wonder if anyone might understand 
what went wrong here ( apart from the fact that PEBCAK :-) )?


That was a very long email from me yesterday, most people don't have 
time for that, so actually let me boil it down to just this question:


Should I be able to run lvextend on a backing device of a live drbd 
resource, but then safely be able to abort everything and "down" the 
resource, without telling drbd to resize it or do anything to it? Or is 
it expected that doing something like that will lead to data corruption 
so just "Dont Do That"?


It's an important question for me to understand as I use drbd on a lot 
of servers and often do very advanced things with the layers below drbd. 
If I find myself in a similar situation again would be good to know for 
sure what is and is not expected behaviour.


Thanks,
Eddie
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] Managed to corrupt a drbd9 resource, but don't fully understand how

2019-07-29 Thread Eddie Chapman

Hello,

I've managed to corrupt one of the drbd9 resources on one of my 
production servers, now I'm trying to figure out exactly what happened 
so I can try and recover the data. I wonder if anyone might understand 
what went wrong here ( apart from the fact that PEBCAK :-) )?


This is a simple two node resource with an ext4 fs directly on top and 
an lvm volume as backing device, nothing special.


Gentoo, drbd kernel 9.0.19-1, drbd utilities 9.10.0, vanilla kernel.org 
4.19.60


Here's the sequence of events:

1. The resource is online in its normal state, connected, dstate 
UpToDate on both nodes. It is Primary/Secondary and I'm working on the 
Primary node. I unmount the fs on top of the resource.


2. I run: e2fsck -f /dev/drbd12
which completes without any problems.

3. I run: drbdadm secondary res0
on the Primary.
The cstate is now Secondary/Secondary

4. I run: drbdadm down res0
on both nodes. Now the resource is fully down on both nodes.

5. I destroy the backing device on the node which is normally Secondary 
(wish I hadn't now but it was a calculated risk that was part of my 
original plan - heh). Now only one node exists.


6. I run: drbdadm up res0
on the remaining node. The resource comes up fine, it is now Secondary, 
UpToDate but disconnected. The fs on top is NOT mounted.


(Oops, later I realise I forgot to make the resource Primary, so it 
stays Secondary. But not sure if this is a factor at all in what follows.)


7. I extend the resource's backing device:

lvextend -l +11149 /dev/vg02/res0
  Size of logical volume vg02/res0 changed from 120.00 GiB (30720 
extents) to 163.55 GiB (41869 extents).

  Logical volume vg02/res0 successfully resized.

Oops, I made a mistake in my lvextend, I wanted to extend using extents 
from a specific PV but forgot to add the PV and extents range on the end 
of the lvextend command. No problem, I know the PV segments and their 
extent ranges from before the extend as I ran lvdisplay -m before I did 
anything. So I can simply delete and re-create the LV using the exact 
same segments and extent ranges to restore it to how it was (this is 
much easier than trying to remove specific PV segments from a LV which 
lvreduce/lvresize cannot do AFAICT, and an approach I've used 
successfully many times before). However I can't run lvremove on the 
backing device while it is still held open by drbd so ...


8. I run: drbdadm down res0
which completes successfully.

Later I look back at the kernel log and see that when I downed the 
resource in this step, the log is normal and as you would expect, apart 
from this 1 line:


drbd xtra3/0 drbd12: ASSERTION drbd_md_ss(device->ldev) == 
device->ldev->md.md_offset FAILED in drbd_md_write


A very obvious sign something is very wrong at this point, but of course 
I don't see that until later when doing my post mortem :-)


Apart from that 1 line all other lines logged when downing the device 
are normal. I've pasted the full log at the bottom of this mail.


9. With the resource down, I now am able to successfully delete and 
re-create the backing device. These are my commands:


lvremove /dev/vg02/res0
Do you really want to remove active logical volume vg02/res0? [y/n]: y
  Logical volume "res0" successfully removed

lvcreate -l 30720 -n res0 vg02 /dev/sdf1:36296-59597 
/dev/sdf1:7797-12998 /dev/sdf1:27207-29422
WARNING: ext4 signature detected on /dev/vg02/res0 at offset 1080. Wipe 
it? [y/n]: n

  Aborted wiping of ext4.
  1 existing signature left on the device.
  Logical volume "res0" created.

I run lvdisplay -m to confirm the list of segments and extent ranges is 
exactly as it was before I extended the LV in step 7, which it is.


Wonderful, I continue under my delusion that this is all going fine so 
far :-)


10. I run: drbdadm up res0
and then:
drbdadm primary res0

which both complete successfully. I've pasted the full kernel log from 
these 2 commands at the end in full.


11. I run: drbdadm primary res0
which completes successfully. All looks fine to me at this point.

12. I decide to run a forced e2fsck on the resource to double check all 
is OK. Here is now the first time I realise something has gone badly wrong:


e2fsck -f /dev/drbd12
e2fsck 1.45.3 (14-Jul-2019)
ext2fs_open2: Bad magic number in super-block
e2fsck: Superblock invalid, trying backup blocks...
Pass 1: Checking inodes, blocks, and sizes
Inode 262145 seems to contain garbage.  Clear?
/dev/drbd12: e2fsck canceled.

/dev/drbd12: * FILE SYSTEM WAS MODIFIED *

I Ctrl-C as soon as it asked me to clear that inode, but too late fs has 
been modified in some way.


I run dumpe2fs which dumps a full set of info on the fs which all looks 
very normal apart from the line:


Filesystem state: not clean

13. I run e2fsck again this time with -n (which I should have done first 
time round silly me). I starts out:


e2fsck -f -n /dev/drbd12
e2fsck 1.45.3 (14-Jul-2019)
Pass 1: Checking inodes, blocks, and sizes
Inode 262145 seems to 

Re: [DRBD-user] 8 Zettabytes out-of-sync?

2018-11-02 Thread Eddie Chapman

On 02/11/18 08:45, Jarno Elonen wrote:

More clues:

Just witnessed a resync (after invalidate) to steadily go from 100% 
out-of-sync to 0% (after several automatic disconnects and reconnects). 
Immediately after reaching 0%, it went to negative -% 
! After that, drbdtop started showing 8.0ZiB out-of-sync.


Looks like a severe wrap-around bug.

-Jarno


On Thu, 1 Nov 2018 at 22:30, Jarno Elonen > wrote:


Here's some more info.
Dmesg shows some suspicious looking log message, such as:

1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive
bitmap' by drbd_r_vm-117-s[5038]

2) Wrong magic value 0x0007 in protocol version 114

3) peer request with dagtag 399201392 not found
got_peer_ack [drbd] failed

4) Rejecting concurrent remote state change 2226202936 because of
state change 2923939731
Ignoring P_TWOPC_ABORT packet 2226202936.

5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but
bitmap already locked for 'write from resync_finished' by
drbd_w_vm-117-s[2812]
md_sync_timer expired! Worker calls drbd_md_sync().

6) incompatible discard-my-data settings
conn( Connecting -> Disconnecting )
error receiving P_PROTOCOL, e: -5 l: 7!

Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All
of them API v 16:

== mox-a ==
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
root@mox-a, 2018-10-28 03:08:58
Transports (api:16): tcp (9.0.15-1)

== mox-b ==
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
root@mox-b, 2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)

== mox-c ==
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
root@mox-c, 2018-10-28 05:45:05
Transports (api:16): tcp (9.0.16-1)

== mox-d ==
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
root@mox-d, 2018-10-29 00:22:23
Transports (api:16): tcp (9.0.16-1)

Running Proxmox (5.2-2) as can you'd guess from host names. DRBD
resources being managed by LINSTOR.


On Thu, 1 Nov 2018 at 17:32, Jarno Elonen mailto:elo...@iki.fi>> wrote:

Okay, today one of these resources got a sudden, severe
filesystem corruption on the primary.

On the other hand, the secondaries (that showed 8ZiB
out-of-sync) were still mountable after I disconnected the
corrupted primary. No idea how current data the secondaries had,
but drbdtop still showed them as connected and 8Zib out-of-sync.

This is getting quite worrisome. Is anyone else experiencing
this with DRBD 9? Is it something really wrong in my setup, or
are there perhaps some known instabilities in DRBD 9.0.15-1?

-Jarno


On Wed, 31 Oct 2018 at 20:46, Jarno Elonen mailto:elo...@iki.fi>> wrote:

I've got several DRBD 9 resource that constantly show
*UpToDate* with 9223372036854774304 bytes (exactly 8ZiB) of
OutOfDate data.

Any idea what might cause this and how to fix it?

Example:

# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
     write-ordering:flush
   volume:0 minor:1003 disk:UpToDate quorum:yes
       size:16777688 read:215779 written:22369564
al-writes:89 bm-writes:0 upper-pending:0
       lower-pending:0 al-suspended:no blocked:no
   mox-a node-id:1 connection:Connected role:Secondary
congested:no ap-in-flight:0
       rs-in-flight:18446744073709549808
     volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
         received:215116 sent:22368903
out-of-sync:9223372036854774304 pending:0 unacked:0
   mox-c node-id:2 connection:Connected role:Secondary
congested:no ap-in-flight:0
       rs-in-flight:18446744073709549808
     volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
         received:1188 sent:19884428 out-of-sync:0 pending:0
unacked:0

Version info:
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
root@mox-b, 2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)

-Jarno


Not exactly the same issue you are seeing, but I have had an issue this 
week with a newly created resource on a 9.0.16-1 primary against a 
9.0.13-1 secondary.


As soon as I started writing to the new primary the secondary started 
repeatedly disconnecting with the error:


drbd resource274 

[DRBD-user] Thanks for the recent commits

2018-08-04 Thread Eddie Chapman
Just wanted to say thanks guys for fixing the missing set-out-of-sync 
bit issue and others in the commits that have just appeared in drbd-9.0 
git repo. I was relieved to see them.


I believe I have been bitten by these bugs more than once in recent 
months, as whenever I have had a Primary resource go diskless for 
whatever reason, in recent months, it has almost always led to problems 
that are difficult to recover from. (I know, I need to find time to 
report issues properly with info & logs to be of more help)


Just yesterday I had a resource's Primary go diskless due to a 
mysterious read error (most likely a problem with a lower layer), and I 
cannot now re-attach the primary's disk again. drbd (9.0.13-1) starts 
negotiating the re-attach with the secondary, says "uuid_compare()=-100 
by rule 100", and just detaches.


By the way, I appreciate the tone of your commit messages :-) Please 
know that you are much appreciated, at least by me, even in the rare 
occasion when severe bugs are found :-) I know that kernel work is hard, 
and I admire & respect anyone who attempts it.


In view of the severity of these bugs, I am _REALLY_ looking forward to
9.0.15 rc2. May I ask if there is an ETA for it yet? Hope I am not 
pestering the "cook" too much, don't want him to throw knives out of the 
"kitchen" at me (or spit in the food) :-)


Thanks,
Eddie
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] How unstable is latest git?

2018-07-19 Thread Eddie Chapman
Thanks for clarifying that Roland, sounds like what is in git in between 
releases should be considered quite unstable then, so I'll stay well 
away until the next release. Looking forward to it :-)


regards,
Eddie

On 17/07/18 14:42, Roland Kammerer wrote:

On Tue, Jul 17, 2018 at 01:08:43PM +0100, Eddie Chapman wrote:

I keep an eye on commits to the drbd 9 repository
https://github.com/LINBIT/drbd-9.0/

and I see quite a few have gone in since 9.0.14-1 was released end of April
( thanks all at Linbit for the hard work :-) )

I'm planning on rebooting a couple of drbd servers using 9.0.14-1 this
weekend to update the kernels (I use the kernel.org stable releases) and am
thinking I might build the kernels with the latest drbd git rather than the
9.0.14-1 tarball.

My question is, how "unstable" would you guys say the latest git is?

Is it a case of "Don't even think about doing that, you're crazy, you'll be
lucky if it even builds, we don't even test it ourselves. Just stick to the
releases or it will eat your data." ?

Or is it more like "there is a small risk you'll be burnt since it has not
been widely tested, you're on your own of course, but more than likely we've
fixed more bugs than we've introduced so it will probably be OK" ?

Maybe it's an impossible question .. just trying to get a feel ...


It is one of those impossible questions. The reason why we push things
in between releases might have various reasons. Maybe we show a proposed
fix, maybe we show something new for testing (which might break things).

For in-between there is no "guarantee" what so ever, sorry, there is no
general rule. And usually we also don't push public in between releases.

But the next release should be "real soon now" ;-).

Regards, rck
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] How unstable is latest git?

2018-07-17 Thread Eddie Chapman

I keep an eye on commits to the drbd 9 repository
https://github.com/LINBIT/drbd-9.0/

and I see quite a few have gone in since 9.0.14-1 was released end of 
April ( thanks all at Linbit for the hard work :-) )


I'm planning on rebooting a couple of drbd servers using 9.0.14-1 this 
weekend to update the kernels (I use the kernel.org stable releases) and 
am thinking I might build the kernels with the latest drbd git rather 
than the 9.0.14-1 tarball.


My question is, how "unstable" would you guys say the latest git is?

Is it a case of "Don't even think about doing that, you're crazy, you'll 
be lucky if it even builds, we don't even test it ourselves. Just stick 
to the releases or it will eat your data." ?


Or is it more like "there is a small risk you'll be burnt since it has 
not been widely tested, you're on your own of course, but more than 
likely we've fixed more bugs than we've introduced so it will probably 
be OK" ?


Maybe it's an impossible question .. just trying to get a feel ...

thanks,
Eddie
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] No resync of oos data in bitmap

2018-05-04 Thread Eddie Chapman

On 04/05/18 09:10, Christiaan den Besten wrote:

Hi !

Question. Using DRBD 9.0.14 (latest from git) we can't get a resync after 
verify working. Having a simple 2-node resource created/configured 8.x style.

A "drbdadm verify" now succesfully ends at 100% ( thank you some much Lars for 
fixing this! ) and it notices inconsistent data blocks ( self inflicted by dd'ing some 
zeros on the secondary node ).

We then have :

[149702.915093] drbd r_drbd9.prolocation.net mhxen20.prolocation.net: conn( 
Unconnected -> Connecting )
[149704.335863] drbd r_drbd9.prolocation.net mhxen20.prolocation.net: Handshake 
to peer 0 successful: Agreed network protocol version 113
[149704.335866] drbd r_drbd9.prolocation.net mhxen20.prolocation.net: Feature 
flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[149704.336280] drbd r_drbd9.prolocation.net mhxen20.prolocation.net: Peer 
authenticated using 20 bytes HMAC
[149704.336299] drbd r_drbd9.prolocation.net mhxen20.prolocation.net: Starting 
ack_recv thread (from drbd_r_r_drbd9. [4924])
[149704.391726] drbd r_drbd9.prolocation.net mhxen20.prolocation.net: Preparing 
remote state change 196805945
[149704.392341] drbd r_drbd9.prolocation.net mhxen20.prolocation.net: 
Committing remote state change 196805945 (primary_nodes=2)
[149704.392364] drbd r_drbd9.prolocation.net mhxen20.prolocation.net: conn( 
Connecting -> Connected ) peer( Unknown -> Secondary )
[149704.397800] drbd r_drbd9.prolocation.net/0 drbd11 mhxen20.prolocation.net: 
drbd_sync_handshake:
[149704.397805] drbd r_drbd9.prolocation.net/0 drbd11 mhxen20.prolocation.net: 
self 9E1AD7F59E5434FA::B3BDA5F13EDDFCEA:EE9BDB393791EAAC bits:0 
flags:120
[149704.397807] drbd r_drbd9.prolocation.net/0 drbd11 mhxen20.prolocation.net: 
peer 9E1AD7F59E5434FA::9E1AD7F59E5434FA:B3BDA5F13EDDFCEA bits:0 
flags:120
[149704.397809] drbd r_drbd9.prolocation.net/0 drbd11 mhxen20.prolocation.net: 
uuid_compare()=0 by rule 38
[149704.397830] drbd r_drbd9.prolocation.net/0 drbd11 mhxen20.prolocation.net: 
repl( Off -> Established )
[149704.405793] drbd r_drbd9.prolocation.net/1 drbd12 mhxen20.prolocation.net: 
drbd_sync_handshake:
[149704.405796] drbd r_drbd9.prolocation.net/1 drbd12 mhxen20.prolocation.net: 
self 686DD0F922994E9C::AEB10B63BD82F43A:6805740BE5A46E08 
bits:1048 flags:120
[149704.405799] drbd r_drbd9.prolocation.net/1 drbd12 mhxen20.prolocation.net: 
peer 686DD0F922994E9C::686DD0F922994E9C:AEB10B63BD82F43A 
bits:1048 flags:120
[149704.405801] drbd r_drbd9.prolocation.net/1 drbd12 mhxen20.prolocation.net: 
uuid_compare()=0 by rule 38
[149704.405803] drbd r_drbd9.prolocation.net/1 drbd12: No resync, but 1048 bits 
in bitmap!
[149704.405821] drbd r_drbd9.prolocation.net/1 drbd12 mhxen20.prolocation.net: 
repl( Off -> Established )

and the same on the other node

[146265.229215] drbd r_drbd9.prolocation.net/1 drbd12 mhxen10.prolocation.net: 
drbd_sync_handshake:
[146265.229218] drbd r_drbd9.prolocation.net/1 drbd12 mhxen10.prolocation.net: 
self 686DD0F922994E9C::686DD0F922994E9C:AEB10B63BD82F43A 
bits:1048 flags:120
[146265.229221] drbd r_drbd9.prolocation.net/1 drbd12 mhxen10.prolocation.net: 
peer 686DD0F922994E9C::AEB10B63BD82F43A:6805740BE5A46E08 
bits:1048 flags:120
[146265.229223] drbd r_drbd9.prolocation.net/1 drbd12 mhxen10.prolocation.net: 
uuid_compare()=0 by rule 38
[146265.229225] drbd r_drbd9.prolocation.net/1 drbd12: No resync, but 1048 bits 
in bitmap!
[146265.229244] drbd r_drbd9.prolocation.net/1 drbd12 mhxen10.prolocation.net: pdsk( 
DUnknown -> UpToDate ) repl( Off -> Established )

with

[root@mhxen10 ~]# grep ^ 
/sys/kernel/debug/drbd/resources/*/connections/*/*/proc_drbd
/sys/kernel/debug/drbd/resources/r_drbd9.prolocation.net/connections/mhxen20.prolocation.net/0/proc_drbd:11:
 cs:Established ro:Primary/Secondary ds:UpToDate/UpToDate C r-
/sys/kernel/debug/drbd/resources/r_drbd9.prolocation.net/connections/mhxen20.prolocation.net/0/proc_drbd:
ns:41941724 nr:0 dw:0 dr:167767960 al:0 bm:0 lo:0 pe:[0;0] ua:0 ap:[0;0] 
ep:1 wo:1 oos:0
/sys/kernel/debug/drbd/resources/r_drbd9.prolocation.net/connections/mhxen20.prolocation.net/0/proc_drbd:
   resync: used:0/61 hits:0 misses:0 starving:0 locked:0 changed:0
/sys/kernel/debug/drbd/resources/r_drbd9.prolocation.net/connections/mhxen20.prolocation.net/0/proc_drbd:
   act_log: used:0/1237 hits:0 misses:0 starving:0 locked:0 changed:0
/sys/kernel/debug/drbd/resources/r_drbd9.prolocation.net/connections/mhxen20.prolocation.net/0/proc_drbd:
   blocked on activity log: 0
/sys/kernel/debug/drbd/resources/r_drbd9.prolocation.net/connections/mhxen20.prolocation.net/1/proc_drbd:12:
 cs:Established ro:Primary/Secondary ds:UpToDate/UpToDate C r-
/sys/kernel/debug/drbd/resources/r_drbd9.prolocation.net/connections/mhxen20.prolocation.net/1/proc_drbd:
ns:41943040 nr:0 dw:0 dr:167773196 al:0 bm:0 lo:0 pe:[0;0] ua:0 ap:[0;0] 

[DRBD-user] drbd9 in mainline eventually?

2017-11-14 Thread Eddie Chapman
Have been looking through the list archives and release notes but cannot 
find any info on this, apologies if has been stated before somewhere 


I was wondering if there are plans to eventually submit drbd9 codebase 
to be integrated into the mainline kernel? AFAICT from looking through 
drbd commits on kernel.org git, this hasn't happened yet.


thanks,
Eddie
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] drbd 9 resource failure due to apparent local io failure, but odd

2017-10-25 Thread Eddie Chapman

Hello,

I was wondering if someone could eyeball the logs further below from a 
resource that has completely failed over yesterday and today and tell me 
if it looks like a "normal" failure from underlying storage, or if there 
is anything strange?


I ask because there are 2 things that are odd:

1. On the primary node drbd reports that the underlying storage fails 
for the resource (1 out of 27, the rest all fine and healthy) on node 1, 
yet there are NO reports of failure from the underlying storage, which 
happens to be a block device used by other (still healthy) resources 
(the drbd backing devices are all logical volumes). The resource goes 
Diskless on the primary but service continues because of the secondary 
which is still fine.


2. 11 hours later, the same happens on the secondary node (different 
machine, different physical storage), drbd reports read failure from 
local storage there (also lvms over block device, the other resources 
also fine), yet no reports of failure from underlying storage. This is 
of course the nail in the coffin for the resource as both resources are 
now Diskless . Again, all other resources that share same block device 
are still fine and 100% healthy, no signs of any other issues on either 
node.


Both nodes are drbd-9.0.9-1 from drbd.org on vanilla kernel.org kernel 
4.9.58.


The failed resource has existed without any problems for many weeks, but 
was originally created with drbd 9-0.8-1 on vanilla kernel 4.4.77. Both 
nodes were upgraded to drbd-9.0.9-1/4.9.58 a few days ago. I don't know 
if this is significant in any way.


Lastly, the failed resource is still there, both sides in Diskless 
state, is there anything I can poke, maybe in /sys/kernel/debug, that 
might give further info about what happened?


Thanks,
Eddie

Here is the log from the primary node when the first failure happened:

drbd RES7H10E/0 drbd42: local READ IO error sector 6192640+16 on dm-43
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in __req_mod. Detaching...
drbd RES7H10E/0 drbd42: sending new current UUID: 2A150CB88CD794F6
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486856, 4096), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486888, 20480), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486928, 57344), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1478264, 4096), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486704, 77824), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1486864, 12288), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1487048, 286720), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1487608, 262144), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
41556976, 61440), but my Disk seems to have failed :(
drbd RES7H10E/0 drbd42: Should have called drbd_al_complete_io(, 
1479376, 4096), but my Disk seems to have failed :(



And the log from the secondary node failure exactly 11 hours later:

drbd RES7H10E/0 drbd42: read: error=10 s=29090424s
drbd RES7H10E/0 drbd42: disk( UpToDate -> Failed )
drbd RES7H10E/0 drbd42: Local IO failed in drbd_endio_read_sec_final. 
Detaching...

drbd RES7H10E/0 drbd42 node1.mydomain: Sending NegDReply. sector=29090424s.
drbd RES7H10E/0 drbd42: disk( Failed -> Diskless )
drbd RES7H10E node1.mydomain: Wrong magic value 0x0090d574 in protocol 
version 112
drbd RES7H10E node1.mydomain: conn( Connected -> ProtocolError ) peer( 
Primary -> Unknown )
drbd RES7H10E/0 drbd42 node1.mydomain: pdsk( Diskless -> DUnknown ) 
repl( Established -> Off )

drbd RES7H10E node1.mydomain: ack_receiver terminated
drbd RES7H10E node1.mydomain: Terminating ack_recv thread
drbd RES7H10E node1.mydomain: Connection closed
drbd RES7H10E node1.mydomain: conn( ProtocolError -> Unconnected )
drbd RES7H10E node1.mydomain: Restarting receiver thread
drbd RES7H10E node1.mydomain: conn( Unconnected -> Connecting )
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] Question about disk-timeout

2017-05-22 Thread Eddie Chapman

Hello,

Am I right in my understanding that disk-timeout is safe to use as long 
as the device that fails *never* comes back?  Because it is the delayed 
"recovery", as the drbd.conf man page puts it, which is what is 
dangerous and could lead to kernel panic, corruption of original request 
pages, etc. Right? In other words just the failing and detaching of the 
backing device itself, due to disk-timeout, could not lead to bad things 
happening?


Can anyone confirm I'm correct?

(I know you might say "well why do you not want the failed device to 
recover?! You just should not use disk-timeout, it's a bad idea." Yes, 
normally I would not use it. But I have a specific, unusual situation 
where drbd and disk-timeout specifically might provide a very good 
solution to a problem, and in this situation if the backing device 
*does* fail due to disk-timeout, it would disappear and never come back 
ever again.)


thanks,
Eddie
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] DRBD 9 diskless Primary, subsequent resync of new block device never completes

2017-05-19 Thread Eddie Chapman

Hello,

This happens to me often lately on an otherwise working well two node 
cluster.


If I have a Primary/Secondary resource, with the Primary then becoming 
diskless through me having run drbdadm detach on it, and I then 
create-md and attach a *new* block device to the Primary, the subsequent 
resync reaches 99% but never completes. Nothing is logged in dmesg at 
99%, resync disk activity stops and never completes. Over time it then 
drops slowly to 98%, 97% if I leave it. In the end I have no choice but 
to detach the new block device. If I re-attach it again same happens, it 
starts completely new resync of whole bitmap. The resource continues 
working fine regardless throughout.


I have a resource right this minute with this problem:

node1 ~ # drbdadm status YK39N2GA
YK39N2GA role:Primary
  disk:Inconsistent
  node2.mydomain role:Secondary
replication:SyncTarget peer-disk:UpToDate done:99.97

Is there anything perhaps I can query on this resource to give some more 
info on what might be wrong? I'll leave it like this for as long as I 
can.  Or anything specific I can monitor, during the resync, if I try 
attaching again?


Both nodes are:

uptodate Gentoo
Vanilla kernel.org 4.4
kernel module 9.0.7, source tar.gz downloaded from drbd.org
drbd utilities 8.9.11 (same)

With the kernel, currently the Primary is at 4.4.68, Secondary at 
4.4.59, if that may be relevant.


I'm using drbdadm and friends rather than drbdmanage. I've used drbd 
many years, I like and am familiar with drbdadm, reluctant to change :-)


Below is what was logged when the new block device was attached if it is 
any help. As I say nothing further is logged after the initial messages 
on attaching.


thanks,
Eddie

[15244.562956] drbd YK39N2GA/0 drbd52: disk( Diskless -> Attaching )
[15244.562970] drbd YK39N2GA/0 drbd52: Maximum number of peer devices = 1
[15244.563682] drbd YK39N2GA/0 drbd52: my node_id: 0
[15244.563686] drbd YK39N2GA/0 drbd52: Adjusting my ra_pages to backing 
device's (768 -> 32)

[15244.563688] drbd YK39N2GA/0 drbd52: my node_id: 0
[15244.563690] drbd YK39N2GA/0 drbd52: drbd_bm_resize called with 
capacity == 58720256
[15244.564163] drbd YK39N2GA/0 drbd52: resync bitmap: bits=7340032 
words=114688 pages=224

[15244.564165] drbd YK39N2GA/0 drbd52: size = 28 GB (29360128 KB)
[15244.591274] drbd YK39N2GA/0 drbd52: Writing the whole bitmap, size 
changed
[15244.603901] drbd YK39N2GA/0 drbd52: recounting of set bits took 
additional 0ms
[15244.603923] drbd YK39N2GA: Preparing cluster-wide state change 
3918652602 (0->-1 7680/2048)
[15244.604139] drbd YK39N2GA: State change 3918652602: primary_nodes=1, 
weak_nodes=FFFC
[15244.604141] drbd YK39N2GA: Committing cluster-wide state change 
3918652602 (0ms)

[15244.604166] drbd YK39N2GA/0 drbd52: disk( Attaching -> Negotiating )
[15244.604170] drbd YK39N2GA/0 drbd52: attached to current UUID: 
0004

[15244.604371] drbd YK39N2GA/0 drbd52 node2.mydomain: drbd_sync_handshake:
[15244.604374] drbd YK39N2GA/0 drbd52 node2.mydomain: self 
0005::: 
bits:7340032 flags:24
[15244.604376] drbd YK39N2GA/0 drbd52 node2.mydomain: peer 
A7E214FA09B78A66:543013BD23E0568A:FBFEDABFE5489160:835D869FF28BD86A 
bits:5236287 flags:100
[15244.604377] drbd YK39N2GA/0 drbd52 node2.mydomain: uuid_compare()=-3 
by rule 20
[15244.604379] drbd YK39N2GA/0 drbd52 node2.mydomain: Writing the whole 
bitmap, full sync required after drbd_sync_handshake.

[15244.620147] drbd YK39N2GA/0 drbd52: disk( Negotiating -> Inconsistent )
[15244.620149] drbd YK39N2GA/0 drbd52 node2.mydomain: repl( Established 
-> WFBitMapT )
[15244.634881] drbd YK39N2GA/0 drbd52 node2.mydomain: receive bitmap 
stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[15244.636111] drbd YK39N2GA/0 drbd52 node2.mydomain: send bitmap stats 
[Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
[15244.636118] drbd YK39N2GA/0 drbd52 node2.mydomain: helper command: 
/sbin/drbdadm before-resync-target
[15244.639180] drbd YK39N2GA/0 drbd52 node2.mydomain: helper command: 
/sbin/drbdadm before-resync-target exit code 0 (0x0)
[15244.639195] drbd YK39N2GA/0 drbd52 node2.mydomain: repl( WFBitMapT -> 
SyncTarget )
[15244.639547] drbd YK39N2GA/0 drbd52 node2.mydomain: Began resync as 
SyncTarget (will sync 29360128 KB [7340032 bits set]).

___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user