Fundamentally you want to define an SLA and then demonstrate that you
are meeting it (or how close you are to meeting it, with improvement
over time).  The problem is how do you define an SLA?

1.  90% of all tickets will be closed in 3 days (measure the number of
tickets that are older than 3 days)
2.  VPN and remote access services up 99.99% of the time (measure
uptime outside of scheduled maintenance windows)
3.  new users have accounts/machines/etc. within n days of their start
(preferably n=-1)
4.  IMAP latency below n microseconds (measure how long it takes to do
a simulated login, read of 100 messages, and log out)

I prefer measuring things that can be measured automatically.  All of
the above can be.  Asking humans to take manual measurements is a
burden and error prone.

I recently started a new assignment where I was supposed to write down
the number of open tickets at the beginning and end of the day, and
keep count of how many tickets I had completed.  Oh brother.  As you
can imagine, I failed.  There wasn't a single day that I had collected
all three data points.  Eventually I found a script that could do that
for me and now the entire team uses it (cron generates the report).

Some things that can't be automatically measured:

1.  customer happiness.  yes, you can send out surveys but that rarely
works.  People don't respond to surveys unless they are OCD or very
angry.  Rather than a survey, it is better to give people a way to
tell a manager that they were unhappy so that the team can be
"educated".  Sometimes it helps to disguise that in the form of a
survey.  Just ask people to rank the results from 1 to 5 and put a big
comment box for them to fill out.  You can ignore the ranking (or
graph it for your boss... if he/she likes graphs).  Pay attention to
the comments.  You'll only get them when someone is angry and needs to
be heard.

2. "return to service".  When there is a dead disk (or dead router,
etc.) how long before you were able to return the service to be
operational.  Don't measure this.  Measure the SLA like above.  If I
was measured on my "return to service" times, I'd stop building
systems with RAID or redundant routers so that I can have a lot of
outages and tons of data to show how good I am.

Lastly, penalize people that beat their SLA.  This is controversial
but hear me out.  If the SLA says we'll have 99.9% uptime, and I
provide 99.999% uptime, that means I'm wasting money on more
redundancy than is needed or avoiding important system upgrades (and
therefore impeding innovation).   If I am hovering around 99.9% by +/-
0.1% then I've demonstrated that I can balance uptime with budget and
innovation.  If management complains about outages but I'm still at
99.9%, then they need to (in writing) tell me what they want to do:
give me more money or slow down the rate of upgrades.  They may back
down or they may choose one of the other options.  That's fine.  If
you think about it, the essential role of management is to set goals
and provide resources to meet those goals.  By working to hit (not
exceed) your SLA you are creating an environment where they can
perform their essential role whether they realize it or not.

Tom
_______________________________________________
Discuss mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to