On 29/09/2021 01:10, Chris Pacejo wrote:
Hi, I have a three-node active/passive DRBD cluster, operating with default
configuration. I had to replace disks on one of the nodes (call it node A) and
resync the cluster.
Somehow, after doing this, A was not in sync with the primary (node C); I only
discovered this because I couldn't even mount the filesystem on it after
(temporarily) making A primary. I don't fully understand how I got into this
situation but that's a tangent for now.
Following instructions in the documentation, I enabled a verification algorithm, and
instructed A to verify (`drbdadm verify <my volume>`). It correctly found many
discrepancies (gigabytes worth!) and emitted the ranges to dmesg.
I then attempted to resynchronize A with C (the primary) by running `drbdadm disconnect
<my volume>` and then `drbdadm connect <my volume>`, again, following
documentation. This did not appear to do anything, despite verify having just found nearly
the entire disk to be out of sync. Indeed, running verify a second time produced the exact
same results.
Instead I forced a full resync by bringing A down, invalidating it, and
bringing it back up again. Now verification showed A and C to be in sync.
What I usually do in this situation (I believe it's because no writes
have hit the primary while disconnected), to avoid the drastic step of
having to completely invalidate a secondary node, is: disconnect the
secondary, force a tiny change on the primary (e.g. touch and delete an
empty file on the filesystem, run a filsystem check which updates the fs
metadata), then reconnect. Of course this forces a resync and, in my
experience and from what I can tell by the number of Kbs resynced, the
resync includes the verified blocks found out of sync).
However A was still showing a small number (thousands) of discrepancies with
node B (the other secondary node). So I repeated the above steps on B --
verify/disconnect/connect/verify -- and again, nothing changed. B still shows
discrepancies between it and both A and C.
Running the same steps on node C (the primary) again found discrepancies with
B, and again failed to resynchronize.
What am I missing? Is there an additional step needed to convince DRBD to
resynchronize blocks found to mismatch during verify?
Further questions --
Why does `drbdadm status` not show whether out-of-sync blocks were found by
`drbdadm verify`? Instead it shows UpToDate like nothing is wrong.
Why is resynchronization only triggered on reconnect? Is there a downside to
simply starting resynchronization when out-of-sync blocks are discovered?
I believe this has just been left for the user to take whatever action
is desired using the out-of-sync helper. I suppose some people might not
want any automatic action taken and just have a helper script send them
a notification so they can manually intervene.
Eddie
Version info:
DRBDADM_BUILDTAG=GIT-hash:\ 5acfd06032d4c511c75c92e58662eeeb18bd47db\ build\
by\ ec2-u...@test-cluster-c.cpacejo.test\,\ 2021-07-06\ 20:48:54
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090102
DRBD_KERNEL_VERSION=9.1.2
DRBDADM_VERSION_CODE=0x091200
DRBDADM_VERSION=9.18.0
dmesg logs below.
Thanks,
Chris
<snip>
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user