In article <4a173189$0$18246$da0fe...@news.zen.co.uk>,
 Andy Yates <andyy1...@gmail.com> writes:

>Hi Hal
>
>Its up to us to specify what we think the SLA should be - the guide is
>"as accurate as possible"!

I think there is an implied "at reasonable cost" in there.

I've never run a data center nor had to hassle with SLAs.

If my boss gave me that task, I'd push back real hard.  Where
is the knee of the benefit curve?  Is 100 ms good enough?
What fraction of the time?  How much more is 10 ms worth?

There are two types of costs.  One is hardware and easy to see.
The other is operations.  If you spec things too tight, you will
create a lot of work for the operations team.

If you do anything sane, the clocks will be within 10-100 ms
most of the time.  Is that good enough?  What sort of "most"
do you need?

Do you have legal requirements?  (as in stock market transactions)
What does your lawyer say?

If you are going to put a SLA for time into a contract.  You
will have to have a way to verify that you are meeting specs
so you might as well start debugging the monitoring process
now.  If you are sufficiently paranoid, you will need (at
least) 2 of them in each data center.

You will also need a time-wizard to keep track of things.


>> How stable is your temperature?  (Both the room and the CPU load.)
>
>Temperature will be very stable, the DC is the very well specified and
>scrupulously engineered - no cables blocking air flow etc. Generally
>speaking the CPU is over specified.

Does anybody ever hold the door open for more than a few seconds?
Can you be sure they won't do it tomorrow?

That's only half the problem.  The other is the source of heat
inside the box.  An active system makes a lot more heat than
an idle one.  To get numbers, I'd setup a system, turn on lots of
logging, leave it idle for a long time (say a day) then look at
the drift.  (It's in loopstats.)  Then start a good load, let it
run for several hours, and see how much the drift changed.
I'd also look at the offset during the transient.

(PS: If your specs are tight, you will have to repeat that
experiment each time you get a new flavor of server box.
It's just another item for the checklist.)


>> What is the load on the LAN between the clients and servers?
>> (Delay is not a problem.  Variation in delay is a problem.)
>The NTP will be on a separate management LAN to the production traffic
>so not subject to the variances that application load has on the network.

That seems like a reasonable assumption.  Are you sure?  Will
it ever get used for an emergency transfer of a large file?
(say recovering from a crashed disk)


There are a handful of things I can think of that will screwup
your clocks.
  temperature
  network load
  software bugs
  operational screwups
  driver quirks

Linux has a history of screwing up the timekeeping kernel code.

Operators can be very ingenious at finding ways to screw things up.
If your time spec is tight enough, you will have to go over the
checklist carefully with time in mind.  You'll need to add things
like "wait x minutes for the system to warm up" when you swap in
a new box for one that died.

Ethernet drivers often try to batch interrupts to reduce CPU
overhead.  Details matter.  Another item for the checklist.

-- 
These are my opinions, not necessarily my employer's.  I hate spam.

_______________________________________________
questions mailing list
questions@lists.ntp.org
https://lists.ntp.org/mailman/listinfo/questions

Reply via email to