Fundamentally you want to define an SLA and then demonstrate that you are meeting it (or how close you are to meeting it, with improvement over time). The problem is how do you define an SLA?
1. 90% of all tickets will be closed in 3 days (measure the number of tickets that are older than 3 days) 2. VPN and remote access services up 99.99% of the time (measure uptime outside of scheduled maintenance windows) 3. new users have accounts/machines/etc. within n days of their start (preferably n=-1) 4. IMAP latency below n microseconds (measure how long it takes to do a simulated login, read of 100 messages, and log out) I prefer measuring things that can be measured automatically. All of the above can be. Asking humans to take manual measurements is a burden and error prone. I recently started a new assignment where I was supposed to write down the number of open tickets at the beginning and end of the day, and keep count of how many tickets I had completed. Oh brother. As you can imagine, I failed. There wasn't a single day that I had collected all three data points. Eventually I found a script that could do that for me and now the entire team uses it (cron generates the report). Some things that can't be automatically measured: 1. customer happiness. yes, you can send out surveys but that rarely works. People don't respond to surveys unless they are OCD or very angry. Rather than a survey, it is better to give people a way to tell a manager that they were unhappy so that the team can be "educated". Sometimes it helps to disguise that in the form of a survey. Just ask people to rank the results from 1 to 5 and put a big comment box for them to fill out. You can ignore the ranking (or graph it for your boss... if he/she likes graphs). Pay attention to the comments. You'll only get them when someone is angry and needs to be heard. 2. "return to service". When there is a dead disk (or dead router, etc.) how long before you were able to return the service to be operational. Don't measure this. Measure the SLA like above. If I was measured on my "return to service" times, I'd stop building systems with RAID or redundant routers so that I can have a lot of outages and tons of data to show how good I am. Lastly, penalize people that beat their SLA. This is controversial but hear me out. If the SLA says we'll have 99.9% uptime, and I provide 99.999% uptime, that means I'm wasting money on more redundancy than is needed or avoiding important system upgrades (and therefore impeding innovation). If I am hovering around 99.9% by +/- 0.1% then I've demonstrated that I can balance uptime with budget and innovation. If management complains about outages but I'm still at 99.9%, then they need to (in writing) tell me what they want to do: give me more money or slow down the rate of upgrades. They may back down or they may choose one of the other options. That's fine. If you think about it, the essential role of management is to set goals and provide resources to meet those goals. By working to hit (not exceed) your SLA you are creating an environment where they can perform their essential role whether they realize it or not. Tom _______________________________________________ Discuss mailing list [email protected] http://lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
