Re: [ceph-users] clock skew

2019-04-28 Thread mj

An update.

We noticed contradicting output from chrony. "chronyc sources" showed 
that chrony was synced. However, we also noted this output:



root@ceph2:/etc/chrony# chronyc activity
200 OK
0 sources online
4 sources offline
0 sources doing burst (return to online)
0 sources doing burst (return to offline)
0 sources with unknown address


so "chrony activity" shows OFFLINE sources.

After we changed sources to nl.pool.ntp.org, "chronyc activity" started 
showing the sources as ONLINE, and now, after a day running, our skew as 
reported by "ceph time-sync-status" is 0.00 on all hosts.


Seems that replying on "chronyc sources" is not always enough to make 
sure that everything is synced indeed.


Thanks for the help!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2019-04-26 Thread Anthony D'Atri
> @Janne: i will checkout/implement the peer config per your suggestion. 
> However what confuses us is that chrony thinks the clocks match, and 
> only ceph feels it doesn't. So we are not sure if the peer config will 
> actually help in this situation. But time will tell.

Ar ar.

Chrony thinks that the clocks match *what*, though?  That each system matches 
the public pool against which it’s synced?

Something I’ve noticed, especially when using the public pool in Asia, is that 
DNS rotation results in the pool FQDNs to resolve differently multiple times a 
day.  And that the quality of those servers varies considerably.  Naturally the 
ones in that pool that I set up a few years ago are spot-on, but I digress ;)

Consider this scenario:

Ceph mon A resolves the pool FQDN to serverX, which is 10ms slow
Ceph mon B resolves the pool FQDN to serverY, which is 20ms fast with lots of 
jitter

That can get you a 30ms spread right there.  This is the benefit of having the 
mons peer with each other as well as with upstream servers of varying 
stratum/quality — worst case, they will select one of their own to sync with.

With iburst and modern client polling backoff, there usually isn’t much reason 
to not configure a bunch of peers.  Multiple Public/vendor pool FQDNs are 
reasonable to include, but I also like to hardcode in a few known-good public 
peers as well, even one in a different region if necessary.  Have your systems 
peer against each other too.  

Depending on the size of your operation, consider your own timeserver to deploy 
on-prem, though antenna placement can be a hassle.

This is horribly non-enterprise, but I also suggest picking up one of these:

https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server/

It’s cheap and it can’t handle tens of thousands of clients, but it doesn’t 
have to.  Stick it in an office window and add it to your peer lists.  If you 
have a larger number of clients, have your internal NTP servers configure it 
(as well as each other, K-complete).  If you don’t, include it in their local 
peer constellation.  Best case you have an excellent low-stratum source for 
your systems for cheap.  Worst case you are no worse off than you were before.

Now, whether this situation is what you’re seeing I can’t say without more 
info, but it is at least plausible.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2019-04-26 Thread mj

Hi all,

Thanks for all replies!

@Huang: ceph time-sync-status is exactly what I was looking for, thanks!

@Janne: i will checkout/implement the peer config per your suggestion. 
However what confuses us is that chrony thinks the clocks match, and 
only ceph feels it doesn't. So we are not sure if the peer config will 
actually help in this situation. But time will tell.


@John: Thanks for the maxsources suggestion

@Bill: thanks for the interesting article, will check it out!

MJ

On 4/25/19 5:47 PM, Bill Sharer wrote:
If you are just synching to the outside pool, the three hosts may end up 
latching on to different outside servers as their definitive sources. 
You might want to make one of the three a higher priority source to the 
other two and possibly just have it use the outside sources as sync. 
Also for hardware newer than about five years old, you might want to 
look at enabling the NIC clocks using LinuxPTP to keep clock jitter down 
inside your LAN.  I wrote this article on the Gentoo wiki on enabling 
PTP in chrony.


https://wiki.gentoo.org/wiki/Chrony_with_hardware_timestamping

Bill Sharer


On 4/25/19 6:33 AM, mj wrote:

Hi all,

On our three-node cluster, we have setup chrony for time sync, and 
even though chrony reports that it is synced to ntp time, at the same 
time ceph occasionally reports time skews that can last several hours.


See for example:


root@ceph2:~# ceph -v
ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) 
luminous (stable)

root@ceph2:~# ceph health detail
HEALTH_WARN clock skew detected on mon.1
MON_CLOCK_SKEW clock skew detected on mon.1
    mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s 
(latency 0.000591877s)

root@ceph2:~# chronyc tracking
Reference ID    : 7F7F0101 ()
Stratum : 10
Ref time (UTC)  : Wed Apr 24 19:05:28 2019
System time : 0.00133 seconds slow of NTP time
Last offset : -0.00524 seconds
RMS offset  : 0.00524 seconds
Frequency   : 12.641 ppm slow
Residual freq   : +0.000 ppm
Skew    : 0.000 ppm
Root delay  : 0.00 seconds
Root dispersion : 0.00 seconds
Update interval : 1.4 seconds
Leap status : Normal
root@ceph2:~# 


For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced 
similarly with NTP on the two other nodes.


We don't understand this...

I have now injected mon_clock_drift_allowed 0.7, so at least we have 
HEALTH_OK again. (to stop upsetting my monitoring system)


But two questions:

- can anyone explain why this is happening, is it looks as if ceph and 
NTP/chrony disagree on just how time-synced the servers are..?


- how to determine the current clock skew from cephs perspective? 
Because "ceph health detail" in case of HEALTH_OK does not show it.
(I want to start monitoring it continuously, to see if I can find some 
sort of pattern)


Thanks!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2019-04-25 Thread Bill Sharer
If you are just synching to the outside pool, the three hosts may end up 
latching on to different outside servers as their definitive sources.  
You might want to make one of the three a higher priority source to the 
other two and possibly just have it use the outside sources as sync.  
Also for hardware newer than about five years old, you might want to 
look at enabling the NIC clocks using LinuxPTP to keep clock jitter down 
inside your LAN.  I wrote this article on the Gentoo wiki on enabling 
PTP in chrony.


https://wiki.gentoo.org/wiki/Chrony_with_hardware_timestamping

Bill Sharer


On 4/25/19 6:33 AM, mj wrote:

Hi all,

On our three-node cluster, we have setup chrony for time sync, and 
even though chrony reports that it is synced to ntp time, at the same 
time ceph occasionally reports time skews that can last several hours.


See for example:


root@ceph2:~# ceph -v
ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) 
luminous (stable)

root@ceph2:~# ceph health detail
HEALTH_WARN clock skew detected on mon.1
MON_CLOCK_SKEW clock skew detected on mon.1
    mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s 
(latency 0.000591877s)

root@ceph2:~# chronyc tracking
Reference ID    : 7F7F0101 ()
Stratum : 10
Ref time (UTC)  : Wed Apr 24 19:05:28 2019
System time : 0.00133 seconds slow of NTP time
Last offset : -0.00524 seconds
RMS offset  : 0.00524 seconds
Frequency   : 12.641 ppm slow
Residual freq   : +0.000 ppm
Skew    : 0.000 ppm
Root delay  : 0.00 seconds
Root dispersion : 0.00 seconds
Update interval : 1.4 seconds
Leap status : Normal
root@ceph2:~# 


For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced 
similarly with NTP on the two other nodes.


We don't understand this...

I have now injected mon_clock_drift_allowed 0.7, so at least we have 
HEALTH_OK again. (to stop upsetting my monitoring system)


But two questions:

- can anyone explain why this is happening, is it looks as if ceph and 
NTP/chrony disagree on just how time-synced the servers are..?


- how to determine the current clock skew from cephs perspective? 
Because "ceph health detail" in case of HEALTH_OK does not show it.
(I want to start monitoring it continuously, to see if I can find some 
sort of pattern)


Thanks!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2019-04-25 Thread John Petrini
+1 to Janne's suggestion. Also, how many time sources are you using? More
tend to be better and by default chrony has a pretty low limit on the
number of sources if you're using a pool (3 or 4 i think?). You can adjust
it by adding maxsources to the pool line.

pool pool.ntp.org iburst maxsources 8
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2019-04-25 Thread Janne Johansson
Den tors 25 apr. 2019 kl 13:00 skrev huang jun :

> mj  于2019年4月25日周四 下午6:34写道:
> >
> > Hi all,
> >
> > On our three-node cluster, we have setup chrony for time sync, and even
> > though chrony reports that it is synced to ntp time, at the same time
> > ceph occasionally reports time skews that can last several hours.
> >
> > But two questions:
> >
> > - can anyone explain why this is happening, is it looks as if ceph and
> > NTP/chrony disagree on just how time-synced the servers are..?
>
> Not familiar with chrony, but for our practice is using NTP, and it works
> fine.
>

What we do with ntpd (and that is probably possible with chrony also) is to
have all mons grab the date from some generic NTP servers, but also add
eachother as peers, which means they sync with eachother about what time it
is, and since the mons are super close to eachother network wise, this is
very stable, compared to what you might get from a random time server on
the internet. It's not super important that they are right about what time
it actually is, only that they all agree with eachother.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2019-04-25 Thread huang jun
mj  于2019年4月25日周四 下午6:34写道:
>
> Hi all,
>
> On our three-node cluster, we have setup chrony for time sync, and even
> though chrony reports that it is synced to ntp time, at the same time
> ceph occasionally reports time skews that can last several hours.
>
> See for example:
>
> > root@ceph2:~# ceph -v
> > ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous 
> > (stable)
> > root@ceph2:~# ceph health detail
> > HEALTH_WARN clock skew detected on mon.1
> > MON_CLOCK_SKEW clock skew detected on mon.1
> > mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s (latency 
> > 0.000591877s)
> > root@ceph2:~# chronyc tracking
> > Reference ID: 7F7F0101 ()
> > Stratum : 10
> > Ref time (UTC)  : Wed Apr 24 19:05:28 2019
> > System time : 0.00133 seconds slow of NTP time
> > Last offset : -0.00524 seconds
> > RMS offset  : 0.00524 seconds
> > Frequency   : 12.641 ppm slow
> > Residual freq   : +0.000 ppm
> > Skew: 0.000 ppm
> > Root delay  : 0.00 seconds
> > Root dispersion : 0.00 seconds
> > Update interval : 1.4 seconds
> > Leap status : Normal
> > root@ceph2:~#
>
> For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced similarly
> with NTP on the two other nodes.
>
> We don't understand this...
>
> I have now injected mon_clock_drift_allowed 0.7, so at least we have
> HEALTH_OK again. (to stop upsetting my monitoring system)
>
> But two questions:
>
> - can anyone explain why this is happening, is it looks as if ceph and
> NTP/chrony disagree on just how time-synced the servers are..?

Not familiar with chrony, but for our practice is using NTP, and it works fine.

> - how to determine the current clock skew from cephs perspective?
> Because "ceph health detail" in case of HEALTH_OK does not show it.
> (I want to start monitoring it continuously, to see if I can find some
> sort of pattern)

You can use 'ceph time-sync-status' to get current time sync status.
>
> Thanks!
>
> MJ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Thank you!
HuangJun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] clock skew

2019-04-25 Thread mj

Hi all,

On our three-node cluster, we have setup chrony for time sync, and even 
though chrony reports that it is synced to ntp time, at the same time 
ceph occasionally reports time skews that can last several hours.


See for example:


root@ceph2:~# ceph -v
ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous 
(stable)
root@ceph2:~# ceph health detail
HEALTH_WARN clock skew detected on mon.1
MON_CLOCK_SKEW clock skew detected on mon.1
mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s (latency 
0.000591877s)
root@ceph2:~# chronyc tracking
Reference ID: 7F7F0101 ()
Stratum : 10
Ref time (UTC)  : Wed Apr 24 19:05:28 2019
System time : 0.00133 seconds slow of NTP time
Last offset : -0.00524 seconds
RMS offset  : 0.00524 seconds
Frequency   : 12.641 ppm slow
Residual freq   : +0.000 ppm
Skew: 0.000 ppm
Root delay  : 0.00 seconds
Root dispersion : 0.00 seconds
Update interval : 1.4 seconds
Leap status : Normal
root@ceph2:~# 


For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced similarly 
with NTP on the two other nodes.


We don't understand this...

I have now injected mon_clock_drift_allowed 0.7, so at least we have 
HEALTH_OK again. (to stop upsetting my monitoring system)


But two questions:

- can anyone explain why this is happening, is it looks as if ceph and 
NTP/chrony disagree on just how time-synced the servers are..?


- how to determine the current clock skew from cephs perspective? 
Because "ceph health detail" in case of HEALTH_OK does not show it.
(I want to start monitoring it continuously, to see if I can find some 
sort of pattern)


Thanks!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clock skew

2018-08-15 Thread Sean Crosby
Hi Dominique,

The clock skew warning shows up when your NTP daemon is not synced.

You can see the sync in the output of ntpq -p

This is a synced NTP

# ntpq -p
 remote   refid  st t when poll reach   delay   offset
jitter
==
 ntp.unimelb.edu 210.9.192.50 2 u   24   64   170.496   -6.421
 0.181
*ntp2.unimelb.ed 202.6.131.1182 u   26   64   170.613  -11.998
 0.250
 ntp41.frosteri. .INIT.  16 u-   6400.0000.000
 0.000
 dns01.ntl02.pri .INIT.  16 u-   6400.0000.000
 0.000
 cosima.470n.act .INIT.  16 u-   6400.0000.000
 0.000
 x.ns.gin.ntt.ne .INIT.  16 u-   6400.0000.000
 0.000

The *'s show that there is a sync with a NTP server. When you start or
restart ntp, it takes a while for a sync to occur

Here's immediately after restarting the ntp daemon

# ntpq -p
 remote   refid  st t when poll reach   delay   offset
jitter
==
 ntp.unimelb.edu 210.9.192.50 2 u-   6410.496   -6.421
 0.000
 ntp2.unimelb.ed 202.6.131.1182 u-   6410.474  -11.678
 0.000
 ntp41.frosteri. .INIT.  16 u-   6400.0000.000
 0.000
 dns01.ntl02.pri .INIT.  16 u-   6400.0000.000
 0.000
 cosima.470n.act .INIT.  16 u-   6400.0000.000
 0.000
 x.ns.gin.ntt.ne .INIT.  16 u-   6400.0000.000
 0.000

Make sure that nothing is regularly restarting ntpd. For us, we had puppet
and dhcp regularly fight over the contents of ntp.conf, and it caused a
restart of ntpd.

Sean


On Wed, 15 Aug 2018 at 19:37, Dominque Roux 
wrote:

> Hi all,
>
> We recently facing clock skews from time to time.
> This means that sometimes everything is fine but hours later the warning
> appears again.
>
> NTPD is running and configured with the same pool.
>
> Did someone else already had the same issue and could probably help us
> to fix this?
>
> Thanks a lot!
>
> Dominique
> --
>
> Your Swiss, Open Source and IPv6 Virtual Machine. Now on
> www.datacenterlight.ch
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clock skew

2018-08-15 Thread Brent Kennedy
For clock skew, I setup NTPD on one of the monitors with a public time server 
to pull from.  Then I setup NTPD on all the servers with them pulling time only 
from the local monitor server.  Restart the time service on each server until 
they get relatively close.  If you have a time server setup already in place, 
that would work as well.  Make sure to eliminate the backup time server entry 
as well.

If this is already in place, then what is usually necessary is a restart of the 
monitor service on the monitor complaining of clock skew.  If any are 
virtualized, make sure the time is not syncing from the host server to the VM, 
this could be causing the skew as well.

-Brent

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Dominque Roux
Sent: Wednesday, August 15, 2018 5:38 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Clock skew

Hi all,

We recently facing clock skews from time to time.
This means that sometimes everything is fine but hours later the warning 
appears again.

NTPD is running and configured with the same pool.

Did someone else already had the same issue and could probably help us to fix 
this?

Thanks a lot!

Dominique
-- 

Your Swiss, Open Source and IPv6 Virtual Machine. Now on www.datacenterlight.ch


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Clock skew

2018-08-15 Thread Dominque Roux
Hi all,

We recently facing clock skews from time to time.
This means that sometimes everything is fine but hours later the warning
appears again.

NTPD is running and configured with the same pool.

Did someone else already had the same issue and could probably help us
to fix this?

Thanks a lot!

Dominique
-- 

Your Swiss, Open Source and IPv6 Virtual Machine. Now on
www.datacenterlight.ch



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-06 Thread lists

Hi Dan,


did you mean "we have not yet..."?

Yes! That's what I meant.

Chrony does much better a job than NTP, at least here :-)

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-05 Thread Dan Mick

> Just to follow-up on this: we have yet experienced a clock skew since we
> starting using chrony. Just three days ago, I know, bit still...

did you mean "we have not yet..."?

> Perhaps you should try it too, and report if it (seems to) work better
> for you as well.
> 
> But again, just three days, could be I cheer too early.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-04 Thread lists

Hi John, list,

On 1-4-2017 16:18, John Petrini wrote:

Just ntp.


Just to follow-up on this: we have yet experienced a clock skew since we 
starting using chrony. Just three days ago, I know, bit still...


Perhaps you should try it too, and report if it (seems to) work better 
for you as well.


But again, just three days, could be I cheer too early.

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread Wido den Hollander

> Op 1 april 2017 om 16:02 schreef John Petrini :
> 
> 
> Hello,
> 
> I'm also curious about the impact of clock drift. We see the same on both
> of our clusters despite trying various NTP servers including our own local
> servers. Ultimately we just ended up adjusting our monitoring to be less
> sensitive to it since the clock drift always resolves on its own. Is this a
> dangerous practice?

It can cause the Monitors to not being able to elect. Very far clock drifts can 
even cause cephx problems.

So yes, it can lead to downtime of your cluster.

The 50ms clock drift it allows by default should be enough and NTP should keep 
it below that.

Wido

> 
> ___
> 
> John Petrini
> 
> NOC Systems Administrator   //   *CoreDial, LLC*   //   coredial.com
> //   [image:
> Twitter]    [image: LinkedIn]
>    [image: Google Plus]
>    [image: Blog]
> 
> Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
> *P: *215.297.4400 x232   //   *F: *215.297.4401   //   *E: *
> jpetr...@coredial.com
> 
> 
> Interested in sponsoring PartnerConnex 2017? Learn more.
> 
> 
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission,  dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
> 
> On Sat, Apr 1, 2017 at 9:12 AM, mj  wrote:
> 
> > Hi,
> >
> > On 04/01/2017 02:10 PM, Wido den Hollander wrote:
> >
> >> You could try the chrony NTP daemon instead of ntpd and make sure all
> >> MONs are peers from each other.
> >>
> > I understand now what that means. I have set it up according to your
> > suggestion.
> >
> > Curious to see how this works out, thanks!
> >
> >
> > MJ
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread John Petrini
Just ntp.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread mj



On 04/01/2017 04:02 PM, John Petrini wrote:

Hello,

I'm also curious about the impact of clock drift. We see the same on
both of our clusters despite trying various NTP servers including our
own local servers. Ultimately we just ended up adjusting our monitoring
to be less sensitive to it since the clock drift always resolves on its
own. Is this a dangerous practice?


Are you running ntp, or this chrony? (which I did not know of)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread John Petrini
Hello,

I'm also curious about the impact of clock drift. We see the same on both
of our clusters despite trying various NTP servers including our own local
servers. Ultimately we just ended up adjusting our monitoring to be less
sensitive to it since the clock drift always resolves on its own. Is this a
dangerous practice?

___

John Petrini

NOC Systems Administrator   //   *CoreDial, LLC*   //   coredial.com
//   [image:
Twitter]    [image: LinkedIn]
   [image: Google Plus]
   [image: Blog]

Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
*P: *215.297.4400 x232   //   *F: *215.297.4401   //   *E: *
jpetr...@coredial.com


Interested in sponsoring PartnerConnex 2017? Learn more.


The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission,  dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

On Sat, Apr 1, 2017 at 9:12 AM, mj  wrote:

> Hi,
>
> On 04/01/2017 02:10 PM, Wido den Hollander wrote:
>
>> You could try the chrony NTP daemon instead of ntpd and make sure all
>> MONs are peers from each other.
>>
> I understand now what that means. I have set it up according to your
> suggestion.
>
> Curious to see how this works out, thanks!
>
>
> MJ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread mj

Hi,

On 04/01/2017 02:10 PM, Wido den Hollander wrote:

You could try the chrony NTP daemon instead of ntpd and make sure all
MONs are peers from each other.
I understand now what that means. I have set it up according to your 
suggestion.


Curious to see how this works out, thanks!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread mj

Hi Wido,

On 04/01/2017 02:10 PM, Wido den Hollander wrote:

That warning is there for a reason. I suggest you double-check your NTP and 
clocks on the machines. This should never happen in production.
I know... Don't understand why this happens..! Tried both ntpd and 
systemd-timesyncd. I did not yet know chrony, will try it.


I imagined that a 0.2 sec time skew would not be too disasterous.. As a 
side note: I cannot find explained anywhere WHAT could happen if the 
skew becomes too big. Only that we should prevent it. (data loss?)



Are you running the MONs inside Virtual Machines? They are more likely to have 
drifting clocks.

Nope. All bare metal on new supermicro servers.


You could try the chrony NTP daemon instead of ntpd and make sure all MONs are 
peers from each other.

Will try that.

I had set all MONs to sync with chime1.surfnet.nl - chime4. We usually 
have very good experiences with those ntp servers.


So, you're telling me that the MONs should be peers from each other... 
But if all MONs listen/sync to/with each other, where do I configure the 
external stratum1 source.?


MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread Wido den Hollander

> Op 1 april 2017 om 11:17 schreef mj :
> 
> 
> Hi,
> 
> Despite ntp, we keep getting clock skews that auto disappear again after 
> a few minutes.
> 
> To prevent the unneccerasy HEALTH_WARNs, I have increased the 
> mon_clock_drift_allowed to 0.2, as can be seen below:
> 

That warning is there for a reason. I suggest you double-check your NTP and 
clocks on the machines. This should never happen in production.

Are you running the MONs inside Virtual Machines? They are more likely to have 
drifting clocks.

You could try the chrony NTP daemon instead of ntpd and make sure all MONs are 
peers from each other.

Wido

> > root@ceph1:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show 
> > | grep clock
> > "mon_clock_drift_allowed": "0.2",
> > "mon_clock_drift_warn_backoff": "5",
> > "clock_offset": "0",
> > root@ceph1:~#
> 
> Despite this setting, I keep receiving HEALTH_WARNs like below:
> 
> > ceph cluster node ceph1 health status became HEALTH_WARN clock skew 
> > detected on mon.1; Monitor clock skew detected mon.1 addr 10.10.89.2:6789/0 
> > clock skew 0.113709s > max 0.1s (latency 0.000523111s)
> 
> Can anyone explain why the running config shows 
> "mon_clock_drift_allowed": "0.2" and the HEALTH_WARN says "max 0.1s 
> (latency 0.000523111s)"?
> 
> How come there's a difference between the two?
> 
> MJ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread mj

Hi!

On 04/01/2017 12:49 PM, Wei Jin wrote:

mon_clock_drift_allowed should be used in monitor process, what's the
output of `ceph daemon mon.foo config show | grep clock`?

how did you change the value? command line or config file?


I guess I changed it wrong then... Did it in ceph.conf, like:

[global]
mon clock drift allowed = 0.1

and for immediate effect, also:
> ceph tell osd.* injectargs --mon_clock_drift_allowed "0.2"

So I guess that's wrong..?

Should it be under the [mon] sections of ceph.conf?

If listed under [global] like I have it now, then what have does it 
actually change..?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2017-04-01 Thread Wei Jin
On Sat, Apr 1, 2017 at 5:17 PM, mj  wrote:
> Hi,
>
> Despite ntp, we keep getting clock skews that auto disappear again after a
> few minutes.
>
> To prevent the unneccerasy HEALTH_WARNs, I have increased the
> mon_clock_drift_allowed to 0.2, as can be seen below:
>
>> root@ceph1:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config
>> show | grep clock
>> "mon_clock_drift_allowed": "0.2",
>> "mon_clock_drift_warn_backoff": "5",
>> "clock_offset": "0",
>> root@ceph1:~#

mon_clock_drift_allowed should be used in monitor process, what's the
output of `ceph daemon mon.foo config show | grep clock`?

how did you change the value? command line or config file?

>
>
> Despite this setting, I keep receiving HEALTH_WARNs like below:
>
>> ceph cluster node ceph1 health status became HEALTH_WARN clock skew
>> detected on mon.1; Monitor clock skew detected mon.1 addr 10.10.89.2:6789/0
>> clock skew 0.113709s > max 0.1s (latency 0.000523111s)
>
>
> Can anyone explain why the running config shows "mon_clock_drift_allowed":
> "0.2" and the HEALTH_WARN says "max 0.1s (latency 0.000523111s)"?
>
> How come there's a difference between the two?
>
> MJ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] clock skew

2017-04-01 Thread mj

Hi,

Despite ntp, we keep getting clock skews that auto disappear again after 
a few minutes.


To prevent the unneccerasy HEALTH_WARNs, I have increased the 
mon_clock_drift_allowed to 0.2, as can be seen below:



root@ceph1:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | 
grep clock
"mon_clock_drift_allowed": "0.2",
"mon_clock_drift_warn_backoff": "5",
"clock_offset": "0",
root@ceph1:~#


Despite this setting, I keep receiving HEALTH_WARNs like below:


ceph cluster node ceph1 health status became HEALTH_WARN clock skew detected on 
mon.1; Monitor clock skew detected mon.1 addr 10.10.89.2:6789/0 clock skew 
0.113709s > max 0.1s (latency 0.000523111s)


Can anyone explain why the running config shows 
"mon_clock_drift_allowed": "0.2" and the HEALTH_WARN says "max 0.1s 
(latency 0.000523111s)"?


How come there's a difference between the two?

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] clock skew detected

2015-06-10 Thread Pavel V. Kaygorodov
Hi!

Immediately after a reboot of mon.3 host its clock was unsynchronized and 
clock skew detected on mon.3 warning is appeared.
But now (more then 1 hour of uptime) the clock is synced, but the warning still 
showing.
Is this ok?
Or I have to restart monitor after clock synchronization?

Pavel.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew detected

2015-06-10 Thread Andrey Korolyov
On Wed, Jun 10, 2015 at 4:11 PM, Pavel V. Kaygorodov pa...@inasan.ru wrote:
 Hi!

 Immediately after a reboot of mon.3 host its clock was unsynchronized and 
 clock skew detected on mon.3 warning is appeared.
 But now (more then 1 hour of uptime) the clock is synced, but the warning 
 still showing.
 Is this ok?
 Or I have to restart monitor after clock synchronization?

 Pavel.



The quorum should report OK after a five-minute interval but there is
a bug which is preventing quorum for doing so at least on oldest
supported stable versions of Ceph. I`ve never reported it because of
its almost zero importance, but things are what they are - the
theoretical behavior should be different and warning should disappear
without restart.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-03-13 Thread Joao Eduardo Luis

On 03/12/2014 05:04 PM, John Nielsen wrote:

On Mar 12, 2014, at 10:44 AM, Gandalf Corvotempesta 
gandalf.corvotempe...@gmail.com wrote:


2014-01-30 18:41 GMT+01:00 Eric Eastman eri...@aol.com:

I have this problem on some of my Ceph clusters, and I think it is due to
the older hardware the I am using does not have the best clocks.  To fix the
problem, I setup one server in my lab to be my local NTP time server, and
then on each of my Ceph monitors, in the /etc/ntp.conf file, I put in a
single server line that reads:

   server XX.XX.XX.XX iburst burst minpoll 4 maxpoll 5


I'm using a local NTP server, all Mons are synced with local NTP but
ceph still detect a clock skew


Machine clocks aren't perfect, even with NTP. Ceph by default is very 
sensitive. I usually add this to my ceph.conf to prevent the warnings:

[mon]
   mon clock drift allowed = .500

That is, allow the clocks to drift up to 1/2 second before saying anything.



Having this as a tunable option is indeed meant to allow one to even 
find the best value.  The current default of .05 was increased from an 
earlier .01 just because our lab's NTP server wasn't able to keep the 
clocks that synchronized.


However, these warnings are meant to act as an early warning system for 
the monitor.  There are some critical messages that need being passed, 
and some timeouts that need to be reset in time.  Failure to do so 
results in weirdness.  And unlike the OSDs, the monitors do rely in real 
time, hence the need for synchronized server clocks;  and failure to 
maintain those clocks synchronized for some time may eventually have 
repercussions: monitors receiving timestamps somewhat in the past, thus 
ignoring them, or timeouts being triggered too soon/late due because a 
message wasn't dully received.


Anyway, most timeouts will hold for 5 seconds.  Allowing clock drifts up 
to 1 second may work, but we don't have hard data to support such claim. 
 Over a second of drift may be problematic if the monitors are under 
some workload and message handling is delayed -- in which case other 
timeouts may have to be adjusted, not only to account for the clock skew 
but the amount of work the monitor has to deal with.



  -Joao


--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-03-13 Thread Gandalf Corvotempesta
2014-03-13 12:59 GMT+01:00 Joao Eduardo Luis joao.l...@inktank.com:
 Anyway, most timeouts will hold for 5 seconds.  Allowing clock drifts up to
 1 second may work, but we don't have hard data to support such claim.  Over
 a second of drift may be problematic if the monitors are under some workload
 and message handling is delayed -- in which case other timeouts may have to
 be adjusted, not only to account for the clock skew but the amount of work
 the monitor has to deal with.

I think that 1 seconds is too much.
I would like to try with .100 or .200 not with seconds
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-03-13 Thread Stijn De Weirdt

can we retest the clock skew condition? or get the value that the skew is?

ceph status gives
 health HEALTH_WARN clock skew detected on mon.ceph003

in a polysh session (ie parallel ssh sort of thing)
ready (3) date +%s.%N
ceph002 : 1394713567.184218678
ceph003 : 1394713567.182722045
ceph001 : 1394713567.185351320

(they are ptp synced)


stijn

On 03/13/2014 01:19 PM, Gandalf Corvotempesta wrote:

2014-03-13 12:59 GMT+01:00 Joao Eduardo Luis joao.l...@inktank.com:

Anyway, most timeouts will hold for 5 seconds.  Allowing clock drifts up to
1 second may work, but we don't have hard data to support such claim.  Over
a second of drift may be problematic if the monitors are under some workload
and message handling is delayed -- in which case other timeouts may have to
be adjusted, not only to account for the clock skew but the amount of work
the monitor has to deal with.


I think that 1 seconds is too much.
I would like to try with .100 or .200 not with seconds
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-03-13 Thread Joao Eduardo Luis

On 03/13/2014 12:30 PM, Stijn De Weirdt wrote:

can we retest the clock skew condition? or get the value that the skew is?


'ceph health detail --format=json-pretty' (for instance, but 'json' or 
'xml' is also allowed) will give you information on a per-monitor basis 
of both skew and latency as perceived by the monitors.


  -Joao



ceph status gives
  health HEALTH_WARN clock skew detected on mon.ceph003

in a polysh session (ie parallel ssh sort of thing)
ready (3) date +%s.%N
ceph002 : 1394713567.184218678
ceph003 : 1394713567.182722045
ceph001 : 1394713567.185351320

(they are ptp synced)


stijn

On 03/13/2014 01:19 PM, Gandalf Corvotempesta wrote:

2014-03-13 12:59 GMT+01:00 Joao Eduardo Luis joao.l...@inktank.com:

Anyway, most timeouts will hold for 5 seconds.  Allowing clock drifts
up to
1 second may work, but we don't have hard data to support such
claim.  Over
a second of drift may be problematic if the monitors are under some
workload
and message handling is delayed -- in which case other timeouts may
have to
be adjusted, not only to account for the clock skew but the amount of
work
the monitor has to deal with.


I think that 1 seconds is too much.
I would like to try with .100 or .200 not with seconds
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-03-12 Thread Gandalf Corvotempesta
2014-01-30 18:41 GMT+01:00 Eric Eastman eri...@aol.com:
 I have this problem on some of my Ceph clusters, and I think it is due to
 the older hardware the I am using does not have the best clocks.  To fix the
 problem, I setup one server in my lab to be my local NTP time server, and
 then on each of my Ceph monitors, in the /etc/ntp.conf file, I put in a
 single server line that reads:

server XX.XX.XX.XX iburst burst minpoll 4 maxpoll 5

I'm using a local NTP server, all Mons are synced with local NTP but
ceph still detect a clock skew
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-03-12 Thread John Nielsen
On Mar 12, 2014, at 10:44 AM, Gandalf Corvotempesta 
gandalf.corvotempe...@gmail.com wrote:

 2014-01-30 18:41 GMT+01:00 Eric Eastman eri...@aol.com:
 I have this problem on some of my Ceph clusters, and I think it is due to
 the older hardware the I am using does not have the best clocks.  To fix the
 problem, I setup one server in my lab to be my local NTP time server, and
 then on each of my Ceph monitors, in the /etc/ntp.conf file, I put in a
 single server line that reads:
 
   server XX.XX.XX.XX iburst burst minpoll 4 maxpoll 5
 
 I'm using a local NTP server, all Mons are synced with local NTP but
 ceph still detect a clock skew

Machine clocks aren't perfect, even with NTP. Ceph by default is very 
sensitive. I usually add this to my ceph.conf to prevent the warnings:

[mon]
  mon clock drift allowed = .500

That is, allow the clocks to drift up to 1/2 second before saying anything.

JN

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] clock skew

2014-01-30 Thread Gandalf Corvotempesta
Hi.
I'm using ntpd on each ceph server and is syncing properly but every
time that I reboot, ceph starts in degraded mode with clock skew
warning.

The only way that I have to solve this is manually restart ceph on
each node (without resyncing clock)

Any suggestion ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-01-30 Thread Emmanuel Lacour
On Thu, Jan 30, 2014 at 12:53:22PM +0100, Gandalf Corvotempesta wrote:
 Hi.
 I'm using ntpd on each ceph server and is syncing properly but every
 time that I reboot, ceph starts in degraded mode with clock skew
 warning.
 
 The only way that I have to solve this is manually restart ceph on
 each node (without resyncing clock)
 
 Any suggestion ?



here, I just wait until the skew is finished, without touching ceph. It
doesn't seems to do anything bad ...

-- 
Easter-eggs  Spécialiste GNU/Linux
44-46 rue de l'Ouest  -  75014 Paris  -  France -  Métro Gaité
Phone: +33 (0) 1 43 35 00 37-   Fax: +33 (0) 1 43 35 00 76
mailto:elac...@easter-eggs.com  -   http://www.easter-eggs.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-01-30 Thread Markus Goldberg

you can run 'ntpdate -b timeserver-ip'
read ntpdate-manual for the parameters.

Markus
Am 30.01.2014 16:05, schrieb Emmanuel Lacour:

On Thu, Jan 30, 2014 at 12:53:22PM +0100, Gandalf Corvotempesta wrote:

Hi.
I'm using ntpd on each ceph server and is syncing properly but every
time that I reboot, ceph starts in degraded mode with clock skew
warning.

The only way that I have to solve this is manually restart ceph on
each node (without resyncing clock)

Any suggestion ?



here, I just wait until the skew is finished, without touching ceph. It
doesn't seems to do anything bad ...




--
MfG,
  Markus Goldberg

--
Markus Goldberg   Universität Hildesheim
  Rechenzentrum
Tel +49 5121 88392822 Marienburger Platz 22, D-31141 Hildesheim, Germany
Fax +49 5121 88392823 email goldb...@uni-hildesheim.de
--


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-01-30 Thread Gandalf Corvotempesta
2014-01-30 Emmanuel Lacour elac...@easter-eggs.com:
 here, I just wait until the skew is finished, without touching ceph. It
 doesn't seems to do anything bad ...

I've waited more than 1 hour with no success.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2014-01-30 Thread Eric Eastman
I have this problem on some of my Ceph clusters, and I think it is due 
to the older hardware the I am using does not have the best clocks.  To 
fix the problem, I setup one server in my lab to be my local NTP time 
server, and then on each of my Ceph monitors, in the /etc/ntp.conf 
file, I put in a single server line that reads:


   server XX.XX.XX.XX iburst burst minpoll 4 maxpoll 5

Where XX.XX.XX.XX is my local NTP time server IP address

Within a few minutes after a reboot, all my monitor clocks sync.

Per the NTP docs, do not use these settings when pointing to an 
external NTP server, as it generates a lot more network traffic then 
the default NTP settings.


Eric


2014-01-30 Emmanuel Lacour elac...@easter-eggs.com:
here, I just wait until the skew is finished, without touching ceph. 

It

doesn't seems to do anything bad ...


I've waited more than 1 hour with no success.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com