[ceph-users] Re: Ceph not warning about clock skew on an OSD-only host?

Anthony D'Atri Wed, 12 Aug 2020 16:10:24 -0700

My understanding is that the existing mon_clock_drift_allowed value of 50 ms 
(default) is so that PAXOS among the mon quorum can function.  So OSDs (and 
mgrs, and clients etc) are out of scope of that existing code.


Things like this are why I like to ensure that the OS does `ntpdate -b` or 
equivalent at boot-time *before* starting ntpd / chrony - and other daemons.

Now, as to why Ceph doesn’t have analogous code to acomplain about other 
daemons / clients - I’ve wonder that for some time myself.  Perhaps there’s the 
idea that one’s monitoring infrastructure should detect that, but that’s a 
guess.

> Yesterday, one of our OSD-only hosts came up with its clock about 8 hours 
> wrong(!) having been out of the cluster for a week or so. Initially, ceph 
> seemed entirely happy, and then after an hour or so it all went South (OSDs 
> start logging about bad authenticators, I/O pauses, general sadness).
> 
> I know clock sync is important to Ceph, so "one system is 8 hours out, Ceph 
> becomes sad" is not a surprise. It is perhaps a surprise that the OSDs were 
> allowed in at all...
> 
> What _is_ a surprise, though, is that at no point in all this did Ceph raise 
> a peep about clock skew. Normally it's pretty sensitive to this - our test 
> cluster has had clock skew complaints when a mon is only slightly out, and 
> here we had a node 8 hours wrong.
> 
> Is there some oddity like Ceph not warning on clock skew for OSD-only hosts? 
> or an upper bound on how high a discrepency it will WARN about?
> 
> Regards,
> 
> Matthew
> 
> example output from mid-outage:
> 
> root@sto-3-1:~#  ceph -s
>  cluster:
>    id:     049fc780-8998-45a8-be12-d3b8b6f30e69
>    health: HEALTH_ERR
>            40755436/2702185683 objects misplaced (1.508%)
>            Reduced data availability: 20 pgs inactive, 20 pgs peering
>            Degraded data redundancy: 367431/2702185683 objects degraded 
> (0.014%), 4549 pgs degraded
>            481 slow requests are blocked > 32 sec. Implicated osds 
> 188,284,795,1278,1981,2061,2648,2697
>            644 stuck requests are blocked > 4096 sec. Implicated osds 
> 22,31,33,35,101,116,120,130,132,140,150,159,201,211,228,263,327,541,561,566,585,589,636,643,649,654,743,785,790,806,865,1037,1040,1090,1100,1104,1115,1134,1135,1166,1193,1275,1277,1292,1494,1523,1598,1638,1746,2055,2069,2191,2210,2358,2399,2486,2487,2562,2589,2613,2627,2656,2713,2720,2837,2839,2863,2888,2908,2920,2928,2929,2947,2948,2963,2969,2972
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph not warning about clock skew on an OSD-only host?

Reply via email to