Looking here https://lists.linbit.com/pipermail/drbd-user/2017-September/023601.html for an explanation how protocol C works it seems that the scenario that I described above is possible if the data is cached above drbd before we disconnect the secondary node.
my disk-flushes and md-flushes are left default, so they are 'yes'. Regards, Janusz. śr., 11 sie 2021 o 01:05 Digimer <li...@alteeve.ca> napisał(a): > On 2021-08-10 3:16 p.m., Janusz Jaskiewicz wrote: > > Hi, > > Thanks for your answers. > > Answering your questions: > DRBD_KERNEL_VERSION=9.0.25 > > Linux kernel: > 4.18.0-305.3.1.el8.x86_6 > > File system type: > XFS. > > So the file system is not cluster-aware, but as far as I understand in an > active/passive setup - single primary (that I have) it should be OK. > Just checked the doc which seems to confirm that. > > I think the problem may come from the way I'm testing it. > I came up with this testing scenario, that I described in my first post, > because I didn't have an easy way to abruptly restart the server. > When I do the hard reset of the primary server it works as expected (at > least I can find a logical explanation). > > I think what happened in my previous scenario was: > Service is writing to the disk, and some portion of the written data is in > a disk cache. As the picture > https://linbit.com/wp-content/uploads/drbd/drbd-guide-9_0-en/images/drbd-in-kernel.png > shows, the cache is above the DRBD module. > Then I kill the service and the network, but some data is still in the > cache. > At some point the cache is flushed and the data gets written to the disk. > DRBD probably reports some error at this point, as it can't send that data > to the secondary node (DRBD thinks the other node has left the cluster). > > When I check the files at this point I see more data on the primary > because it also contains the data from the cache, which were not replicated > because the network was down when the data hit the DRBD. > > When I do the hard restart of the server, data in the cache is lost, so we > don't observe the result as above. > > Does it make sense? > > Regards, > Janusz. > > OK, it sounded from your first post like you have the FS mounted on both > nodes at the same time, that would be a problem. If it's only mounted in > one place at a time, then it's OK. > > As for caching; DRBD on the Secondary will say "write complete" to the > primary, in protocol C, when it has been told that the disk write is > complete. So if the cache is _above_ drbd's kernel module, then that's > probably not the problem because the Secondary won't tell the primary it's > done until it receives the data. If there is a caching issue _below_ DRBD > on the Secondary, then it's _possible_ that's the problem, but I doubt it. > The reason is that whatever is managing the cache below DRBD on the > Secondary should know that a given block hasn't flushed yet and, on read > request, read from cache not disk. This is a guess on my part. > > What are your 'disk { disk-flushes [yes|no]; and md-flushes [yes|no]; }' > set to? > > -- > Digimer > Papers and Projects: https://alteeve.com/w/ > "I am, somehow, less interested in the weight and convolutions of Einstein’s > brain than in the near certainty that people of equal talent have lived and > died in cotton fields and sweatshops." - Stephen Jay Gould > >
_______________________________________________ Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list drbd-user@lists.linbit.com https://lists.linbit.com/mailman/listinfo/drbd-user