Thank you! ________________________________ From: Philipp Reisner <philipp.reis...@linbit.com> Sent: Thursday, April 4, 2024 1:06 PM To: Tim Westbrook <tim_westbr...@selinc.com> Cc: drbd-user@lists.linbit.com <drbd-user@lists.linbit.com> Subject: Re: Usynced blocks if replication is interrupted during initial sync
[Caution - External] Hello Tim, We were able to write a reproducer test case and fix this regression with this commit: https://urldefense.com/v3/__https://github.com/LINBIT/drbd/commit/be9a404134acc3d167e8a7e60adce4f1910a4893__;!!O7uE89YCNVw!Lg3rRgojII2WxVzSLqO-h7mIpRxkiz34chmd89P-b1GDlUP3QD3-jc3gdlj5aTFp9uwgCw_5PBjXtwPtevJ0JK_oC8s8ZGg$ This commit will go into the drbd-9.1.20 and drbd-9.2.9 releases. best regards, Philipp On Fri, Mar 22, 2024 at 1:49 AM Tim Westbrook <tim_westbr...@selinc.com> wrote: > > > > Thank you > > > So if "Copying bitmap of peer node_id=0" on reconnect after interruption, > indicates the issue, the issue still exists for me. > > I am able to dump the metadata, but not sure it is very useful at this > point... > > I have not tried invalidating it after a mount/unmount, nor have I tried > invalidating it after adding a node, but we were trying to avoid unmounting > once configured. > > Would you recommend against going back to a release version prior to this > change? > > Is there any other information I can provide that would help ? Could I dump > the meta data at any some point to show the expected/unexpected state? > > Latest flow is below > > Thank you so much for your assistance, > Tim > > 1. /dev/vg/persist mounted directly without drbd > 2. Enable DRBD by creating a single node configuration file > 3. Reboot > 4. Create metadata on separate disk (--max-peers=5) > 5. drdbadm up persist > 6. drbdadm invalidate persist > 7. drbdadm primary --force persist > 8. drbdadm down persist > 9. drbdadm up persist > 10. drbdadm invalidate persist* > 11. drbdadm primary --force persist > 12. mount /dev/drbd0 to /persist > 13. start using that mount point > 14. some time later > 15. Modify configuration to add new target backup node > 16. Copy config to remote node and reboot, it will restart in secondary > 17. drbdadm adjust persist (on primary) > 18. secondary comes up and initial sync starts > 19. stop at 50% by disabling network interface > 20. re-enable network interface > 21. sync completes right away - node-id 0 message here > 22. drbdadm verify persist - fails many blocks > > > > > From: Joel Colledge <joel.colle...@linbit.com> > Sent: Wednesday, March 20, 2024 12:02 AM > To: Tim Westbrook <tim_westbr...@selinc.com> > Cc: drbd-user@lists.linbit.com <drbd-user@lists.linbit.com> > Subject: Re: Usynced blocks if replication is interrupted during initial sync > > [Caution - External] > > > We are still seeing the issue as described but perhaps I am not putting the > > invalidate > > at the right spot > > > > Note - I've added it at step 6 below, but I'm wondering if it should be > > after > > the additional node is configured and adjusted (in which case I would need > > to > > unmount as apparently you can't invalidate a disk in use) > > > > So do I need to invalidate after every node is added? > > With my reproducer, the workaround at step 6 works. > > > Also Note, the node-id in the logs from the kernel is 0 but peers are > > configured with 1 and 2 , > > is this an issue or they separate ids? > > I presume you are referring to the line: > "Copying bitmap of peer node_id=0" > The reason that node ID 0 appears here is that DRBD stores a bitmap of > the blocks that have changed since it was first brought up. This is > the "day0" bitmap. This is stored in all unused bitmap slots. All > unused node IDs point to one of these bitmaps. In this case, node ID 0 > is unused. So this line means that it is using the day0 bitmap here. > This is unexpected, as mentioned in my previous reply. > > Joel