Le mardi 07 novembre 2017 à 13:41 +0100, Dejan Muhamedagic a écrit : > Hi, > > On Mon, Nov 06, 2017 at 10:52:12AM +0100, Nicolas Huillard wrote: > > Hello, > > > > I have many of those above syslog messages from boothd (counting > > all servers, that's nearly 1 hundred per day). > > All sites are synchronized using NTP, but according to source > > (https://github.com/ClusterLabs/booth/blob/master/src/transport.c), > > that specific messages isn't event tied to maxtimeskew (which I > > forced to 120s, because I wondered it it wrongly defaulted to 0s). > > This message is output unless the new timestamp is larger (>) that > > the previous one from the same site). > > Right. > > > Adding debug on the arbitrator, it appears that every exchange > > between the servers is made of at least 2 messages in each > > direction. Could it be that two consecutive messages have the exact > > same timestamp, and thus trigger the warning? > > The time resolution used should be sufficiently fine (i.e. > microseconds) so that the timestamps of two consecutive packets > are not the same. At least I'd expect that to be so. Now, what is > the actual time resolution on your platform? Maybe you should > check the clocksource? Still, if these are not some special kind > of computing platform, I suppose that there shouldn't be any > surprises.
My servers are regular Intel platforms ("Intel(R) Atom(TM) CPU C2750 @ 2.40GHz" and "Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz"), using Debian stretch with kernel 4.9.51. According to boot-time logs, TSC is selected as the clocksource: kernel: [ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns kernel: [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns kernel: [ 1.512907] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns kernel: [ 1.905780] clocksource: Switched to clocksource hpet kernel: [ 1.932222] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns kernel: [ 3.861842] tsc: Refined TSC clocksource calibration: 2099.998 MHz kernel: [ 3.861859] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e452ea631d, max_idle_ns: 440795244572 ns kernel: [ 5.345808] clocksource: Switched to clocksource tsc https://superuser.com/questions/393969/what-does-clocksource-tsc-unstab le-mean#393978 "In short, on modern systems, the TSC sucks for measuring time accurately" and "There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized" Oops: this means that the timestamps sent may not be increasing, depending on which core boothd runs... The CPUs are currently underutilised, which can lead to increased discrepancy between cores TSCs. I can either: * switch to another clocksource (I don't yet know how to do that) * lock boothd on a specific core (I don't know if I can do that) * ignore these messages altogether (the next one re. "timestamp older than skew" will still happen) > How often are these messages logged, compared to the expire (or > renewal_freq) time? 3 nodes, 5 min expire, 2 tickets = 1728 messages per day, vs. 108 messages on all host in the last 24h = 6% fail. Apparently, the Xeon D fail more often than the Atom C. I wonder why this problem is not more widely experienced (not that it's a big problem). > Thanks, > > Dejan Thanks for your work! -- Nicolas Huillard _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org