Hi Christoph, 

I believe that, at least for synchronous replication with protocol C, the oos 
count should always be 0 in a healthy, fully synchronized configuration, and 
that any occurance of a value >0 (except for currently running manual 
administrative tasks) indicates a problem that requires to be investigated. 
Therefore I regard an automated disconnect-connect, for the sole purpose of 
clearing the oos counter without determining the cause, both a very bad idea 
and bad practice.
We have run hundreds of synchronously replicated DRBD8 volumes for years now 
that we verify weekly, but we never ever sighted oos that were not either 
caused by a runtime, configuration or hardware issue.

Our verification runs utilise a script similar to yours, but it actively 
parallelises the task to optimise for minimum duration while maintaining a 
constant load that won't harm performance. It does so by sorting all volumes by 
size and then run a given number of verify tasks at once, beginning with the 
largest volumes, and starting the next verify once one finishes. Especially on 
machines that have few very big volumes and lots of small ones, this allows to 
complete the verification of all volumes at the time the big volumes take 
alone, thus minimal duration at constant I/O load without peaks. The script 
prints a report to stdout with any occurance of oos to stderr, making it easy 
to filter for any problems -- even before monitoring notices. 

Best regards, 
// Veit 


-------- Ursprüngliche Nachricht --------
Von: Christoph Lechleitner <christoph.lechleit...@iteg.at>
Gesendet: 28. Dezember 2017 01:05:30 MEZ
An: drbd-user <drbd-user@lists.linbit.com>
CC: Wolfgang Glas <wolfgang.g...@iteg.at>
Betreff: [DRBD-user] Semantics of oos value, verification abortion

Hello everbody!


I have a question regarding the exact semantics of the oos value in
/proc/drbd.


The Users Guide
  https://docs.linbit.com/doc/users-guide-84/ch-admin/
says:
  "oos (out of sync). Amount of storage currently out of sync; in
Kibibytes. Since 8.2.6."

After several uncomforting events over the years we have now started to
do regular verify runs.

We will announce our script as open source right here at some point in
the future, but we want to clarify some details first.

Our script basically calls
  drbdadm verify
on one resource at a time, because
  drbdadm verify all
would kill the system for sure.

After the verification run has completed, the script
- analyses the oos: value,
- eventually disconnects & connects the resource
- starts verification of the next resource

The script does not run as daemon, it's simply called regularily via
cron, on the node with the more important resources.


My main question is:

Should the oos value always be 0?

Does a non-0 value of oos mean that there have been sync errors?

Or does oos include blocks that are currently beeing synched or waiting
to be synched, too?

In the latter case, what would be a valid condition to disconnect &
connect a resource after a verification run?


Also: Are there events that can cause a verification run to be aborted?

One verification run on a huge resource (1.3 TB, HW RAID 5, dedicated
GBit line) was finished way too fast, so I think something must have
aborted it, like, say,
- a buffer runs full
-> automatic disconnect/reconnect
-> verification aborted

If something along this line is possible, is there a way to avoid or
detect that?
Maybe a kernel message we could grep for?


Thanks,

Regards,

Christoph


-- 

Christoph Lechleitner

Geschäftsführung

------------------------------------------------------------------------
ITEG IT-Engineers GmbH | Conradstr. 5, A-6020 Innsbruck
Mail: christoph.lechleit...@iteg.at | Web: http://www.iteg.at/
------------------------------------------------------------------------

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to