[LEAPSECS] Leapseconds, more evidence
I'm told that this is the power usage from one of Hetzner's datacenters over the leap-second: http://imgur.com/a/ykoup -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs
Re: [LEAPSECS] Leapseconds, more evidence
Interesting. A few questions spring to mind. Let me preface them by stating that this is far from my area of expertise and I'd be delighted to be educated here. 1) What are the units on the y-axis? 2) This shows the surge continuing for at least half a day after the event with no significant downward jumps from servers being rebooted, fixes being applied (as described in previous emails), or other apparent attempts at addressing the situation. Was nobody monitoring the data center? Weren't alerts sent out to offsite staff? Did they attempt to implement any fixes? 2a) Do we know if any of the datacenter customers had prepared their workflows/systems in advance of the leap second? That is, do we know whether the left hand side might be lower than normal? 3) There is a shallow decreasing trend after the event. Is this what one would expect from affected servers left on their own? Does the problem(s) resolve itself eventually or is operator interaction required? 3a) What happened after this plot was made? Perhaps staff finally arrived and gave it a kick? Did the power ramp back down to normal? Judging from the trend in the plot, they will otherwise return to normal around midday tomorrow. 4) The upward jump is very rapid. Hard to tell from the scale, but it appears to be a facility-wide 15% jump in, say, one minute or less. Is this supportable by the power infrastructure onsite or on the local grid (assuming it is on the grid)? Is the plot smoothed in any way? 5) This is a complex figure-of-merit and only distantly related to server processing load. Might one expect the power usage to exhibit ringing behaviors from the rapid jump? What is the typical mix of power consumption in a data center between CPUs, disks, and cooling, etc? Wouldn't these each have different time constants that a naive viewer (i.e., me) might think would be visible? 6) The small scale structure remains similar before-and-after the event. Wouldn't one expect the system loading to interact in some complex way with the actual workflows the data center is in business to serve? One might think the small scale would become either more variable or perhaps even flatten out as CPUs pegged. As the CPUs pegged wouldn't disk I/O decrease (or at least change in some fashion)? 7) This is a 24 hour plot. I guess the hourly peaks would be some sort of housekeeping workflows, and there surely would be load-balancing to squeeze the most out of the datacenter - but is it normal to otherwise see no diurnal variation? And if there is load-balancing, is that the 2-3 day ramp we're seeing after the event? One would think the time-constant would be much more rapid. 8) Is a typical day flat (outside small scale structure) at just about 91 units? Is there usually a difference between a Saturday night and a Sunday morning? Provenance would be appreciated. Whatever our positions on the issues, they'll be strengthened by something more reliable than I'm told that. For instance, what's the mix of host OSes - is there otherwise reason to believe the datacenter was a candidate for the various issues described to date? I don't suppose this particular datacenter is used by Amadeus? :-) Rob -- On Jul 2, 2012, at 11:53 AM, Poul-Henning Kamp wrote: I'm told that this is the power usage from one of Hetzner's data centers over the leap-second: http://imgur.com/a/ykoup ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs
Re: [LEAPSECS] Leapseconds, more evidence
In message eeebea0e-71cf-490d-98e5-3d01ab131...@noao.edu, Rob Seaman writes: Interesting. A few questions spring to mind. Let me preface them by stating that this is far from my area of expertise and I'd be delighted to be educated here. The original source may be this: https://plus.google.com/117024231055768477646/posts/2pkWbDiEDQG 1) What are the units on the y-axis? Watt. 4) The upward jump is very rapid. Hard to tell from the scale, but it appears to be a facility-wide 15% jump in, say, one minute or less. Is this supportable by the power infrastructure onsite or on the local grid (assuming it is on the grid)? Is the plot smoothed in any way? It's only 135 kW, not really a big deal for a data-center, much less of a deal for the power-grid. I don't suppose this particular datacenter is used by Amadeus? :-) Well, Kristian Köhntopp works for booking.com... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs
Re: [LEAPSECS] Leapseconds, more evidence
Rob Seaman wrote: 7) This is a 24 hour plot. I guess the hourly peaks would be some sort of housekeeping workflows, Hourly cron jobs. Most people are idiots and schedule them all for the top of the hour. The graph also shows discernible regular peaks for aligned cycles of 30, 15, 10, and 5 minutes. Of course, for the same reason, daily cron jobs are mst often scheduled for midnight, so it's expected that there be an especially big peak at midnight. I'd like to see a normal midnight's graph as well, because that's a confounding effect that complicates interpretation of this graph. The downward trend seen in the hours following midnight rather resembles the smaller downward trend seen within each hour. Particularly obvious in the last three hours before UT midnight, but visible in nearly every hour, there's a long-lived jump at the top of each hour, at the same time as (but distinct from) the short-lived hourly peak. Is this the normal structure at that scale? -zefram ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs
Re: [LEAPSECS] Leapseconds, more evidence
On Jul 2, 2012, at 1:47 PM, Poul-Henning Kamp wrote: The original source may be this: https://plus.google.com/117024231055768477646/posts/2pkWbDiEDQG You'd think google+ would let you run google translate… 1) What are the units on the y-axis? Watt. … It's only 135 kW, not really a big deal for a data-center, much less of a deal for the power-grid. Ok - from the Hetzner web page I was imagining a much larger operation. The other questions about the waveform remain in play. Thanks! Rob ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs
Re: [LEAPSECS] Leapseconds, more evidence
Zefram wrote: Hourly cron jobs. Most people are idiots and schedule them all for the top of the hour. The graph also shows discernible regular peaks for aligned cycles of 30, 15, 10, and 5 minutes. Of course, for the same reason, daily cron jobs are mst often scheduled for midnight, so it's expected that there be an especially big peak at midnight. Good point. Though at an observatory the midnight issue often translates to noon. Interesting notion about dithering crontab and init scheduling in general. Grabbing a crontab from a random server here - one that likely several folks have added to - many entries are on even cycles, but there are some interesting choices like 47 minutes after the hour, or activities in the middle of the afternoon. I was taken by the 17 minute cycle quoted for one of the issues in several reports (though perhaps borrowing from the same source). Anybody know what that was about? Something that was programmed in, or emergent behavior like locusts? I'd like to see a normal midnight's graph as well, because that's a confounding effect that complicates interpretation of this graph. Yes! The downward trend seen in the hours following midnight rather resembles the smaller downward trend seen within each hour. Particularly obvious in the last three hours before UT midnight, but visible in nearly every hour, there's a long-lived jump at the top of each hour, at the same time as (but distinct from) the short-lived hourly peak. Is this the normal structure at that scale? The version on google+ is a bit more readable. Not sure how much analysis is useful without more context. Rob ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs
Re: [LEAPSECS] Leapseconds, more evidence
Rob Seaman wrote: but there are some interesting choices like 47 minutes after the hour, I make a point of randomising the phase of all my cron jobs. Cron makes it easy to schedule a job for the same time every hour or every day, so I tend to use hourly or daily cycles but with a static random selection of phase. E.g., my job to check for a new version of the Olson timezone database runs at 7 minutes past each hour, because 7 is what came out of echo $((RANDOM%60)) (zsh) when I set it up. Another technique I've seen is to put a random sleep first thing in the job's commands, so that it's dithered per run. I was taken by the 17 minute cycle quoted for one of the issues in several reports (though perhaps borrowing from the same source). Anybody know what that was about? That's ntpd. It uses power-of-two numbers of seconds for the pauses between polls of each peer, and so tends to update tracking parameters on that cycle. (It doesn't maintain a strict cycle time; you get an approximate cycle dominated by that pause time.) It goes up to 1024 s, which is 17 min + 4 s. ntpq -c pe shows the selected pause time for each peer in the poll column. -zefram ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs
[LEAPSECS] Good description of Linux kernel bugs related to leap seconds
From Steven Bellovin, on NANOG: See http://landslidecoding.blogspot.com/2012/07/linuxs-leap-second-deadlocks .html I've been hoping somebody would post a good summary like that. It covers 5 different bugs. I'd split them into 3 clumps. The first clump is kernel deadlocks. As often happens with hard problems, fixing it here breaks it over there. That's the first 3 bugs. The 4th bug explains the CPU load spikes. A kernel bug broke futex-es. They are typically used in spin-lock like loops from user code. I don't understand the details. Setting the time (to any value) fixes the kernel problem so everything starts to work normally again. (No need to reboot.) The 5th bug is still a mystery. Also from NANOG: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. I agree. This is a good excuse to find my copy and read it again. Yes, it's very good, but not directly related to leap seconds. -- These are my opinions. I hate spam. ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs
[LEAPSECS] the next future of utc
In the aftermath of the 2012 leap second we announce the next meeting http://futureofutc.org/ Requirements for UTC and Civil Timekeeping on Earth A Colloquium Addressing a Continuous Time Standard to be held at the University of Virginia, Charlottesville, VA May 29-31, 2013. -- Steve Allen s...@ucolick.orgWGS-84 (GPS) UCO/Lick Observatory--ISB Natural Sciences II, Room 165Lat +36.99855 1156 High StreetVoice: +1 831 459 3046 Lng -122.06015 Santa Cruz, CA 95064http://www.ucolick.org/~sla/ Hgt +250 m ___ LEAPSECS mailing list LEAPSECS@leapsecond.com http://six.pairlist.net/mailman/listinfo/leapsecs