[ug-msosug] High LCK% in prstat -mL

Richard Smith Wed, 23 Sep 2009 10:47:02 +1000

Alexander Box wrote:

> In the document Brendan makes use of %b in iostat, which he/the 
> co-authors
> also recommend doing in Solaris Performance and Tools. Given that this 
> was
> discouraged on Wednesday, is there a more accurate equivalent available?


Unfortunately I wasn't there on Wednesday to know what was said, or what
the concern about %b in iostat is. I would argue that no single number by
itself is useful, but taken as a collection they are.

Computer system performance can often usefully be thought of as a network
of queues, through which work flows. A request arrives at one queue, waits
a while, gets serviced, then moves on to the next queue (or perhaps departs
the system completely).

We tend to characterise queues by things like arrival rate, queue length,
service time, queueing discipline, etc. There's a well-known relationship,
known as Little's Law: L = AW, also expressed as Q = lambda R. This relates
queue length [L,Q] to arrival rate [A,lambda] and response time [W,R].
Little's Law can be applied at any scale, so long as you're consistent
about what's inside the box and what's outside the box. One key observation
is that if you have two of the three pieces of information, you can derive
the third.

Because modern disks, and especially volumes from hardware RAID arrays,
have complicated internal structure, %b is much less useful as a predictor
of performance than when we only dealt with JBOD. It might take only one
stream of i/o for a device to be considered 100% busy, yet the volume have
multiple internal disks capable of supporting many concurrent i/os.

So instead these days, I tend to focus on response times AND average queue
lengths. Again, its just a question of where you "draw the box", when
interpreting some of these metrics. A response time of a disk at a low
level in a system can become a key component of a service time at a
higher level.

Disk devices generally have their own internal queues to support having
multiple i/os simultaneously "active" [as viewed from outside the device],
and Tagged Command Queueing is used to match responses with their requests.
A "service time" of 20ms may not be considered bad if on average the 
internal
disk queue has a length of 6. However 20ms may be unacceptable for a 
latency-
sensitive application.

Some queues may be limited in the number of requests they can hold. Disk
devices can be interrogated to find out what that limit is for them, so
that device drivers don't try to send more requests at a time than they
can cope with. The excess requests have to be held somewhere, so the device
driver provides an additional queue, and time spent in that "wait queue"
has to be considered as well.

In practice, what this means is that there's quite a lot of information in
iostat output, as it describes two or three layers of queues and associated
"servers". Each layer has its own concept of queue length, wait time, and
service time. %b is just one piece of a much bigger puzzle.

-- 
============================================================================
   ,-_|\   Richard Smith Staff Engineer PAE
  /     \  Sun Microsystems                   Phone : +61 3 9869 6200
richard.smith at Sun.COM                        Direct : +61 3 9869 6224
  \_,-._/  476 St Kilda Road                    Fax : +61 3 9869 6290
       v   Melbourne Vic 3004 Australia
===========================================================================

[ug-msosug] High LCK% in prstat -mL

Reply via email to