Re: The Case for a Universal Web Server Load Value

Graham Dumpleton Mon, 12 Nov 2012 17:02:42 -0800

You say:

"""I have traditional Unix-type load-average and the percentage of how
"idle" and "busy" the web-server is. But is that enough info? Or is that
too much? How much data should the front-end want or need? Maybe a single
agreed-upon value (ala "load average") is best... maybe not. These are the
kinds of questions to answer."""

How is 'idle' and 'busy' measure being calculated?

Now to deviate a bit in to related topic .....

One of the concerns I have had when looking over how MPMs work of late is
that the measure of how many threads are busy used to determine whether
processes should be created or destroyed is a spot measure. At least that
is how I interpret the code and I could well be wrong, so please correct me
if I am :-)

That is, how many threads are in use only at the time the maintenance cycle
is run is taken into consideration.

In the Python world where one cannot preload in the Apache parent the
Python interpreter or your application for various reasons, and need to
defer it to child worker processes, recycling processes can be an expensive
exercise as everything is done in the child after the fork.

What worries me is that the current MPM calculation with using a spot
measure isn't really a true indication of how much the server is being
utilised over time. Imagine the worst case where you were under load and
had a large number of concurrent requests and a commensurate number of
processes, but a substantial number finished just before the maintenance
cycle ran. The spot measure could use a quite low number which doesn't
truly reflect the request load on the server in the period just before
that, and what may therefore come after.

As a result of a low number for a specific maintenance cycle, it could
think it had more idle threads than needed and kill off one process. On
next cycle one second later the maintenance cycle may hit again when high
number of concurrent request and think it has to create a process again.

Another case is where you had a momentary network issue and so requests
were getting through and so for a short period the busy measure was low and
number of processes progressively get killed off at a rate of one a second.

Using a spot measure rather than looking at business over an extended
window of time, especially when killing processes, could cause process
recycling when not warranted or when it would be better that it simply
didn't do it.

The potential for this is in part avoided by what the min/max idle threads
is set to. That is, it effectively smooths out small fluctuations, but
because the busy measure is a spot metric, am still concerned that the
randomness of when requests run means that the spot metric could still jump
around quite a lot between maintenance cycles to the extent that could
exceed min/max levels and so kill off processes.

Now for a Python site where recycling processes is expensive, the solution
is to reconfigure the MPM settings to start more servers at the outset and
allow a lot more idle capacity. But we know how many people actually bother
to tune these settings properly.

Anyway, that has had me wondering, and why I ask how you are calculating
'idle' and 'busy', whether such busy measures should not perhaps be done
differently so that it can look back in time at prior traffic during the
period since last maintenance cycle or even beyond that.

One way of doing this is looking at one I call thread utilisation or what
some also refer to as instance busy.

At this point it is going to be easier for me to refer to:

http://blog.newrelic.com/2012/09/11/introducing-capacity-analysis-for-python/

which has some nice pictures and description to help explain this thread
utilisation measure.

The thread utilisation over time since last maintenance cycle could
therefore be used, perhaps weighted in some way with current spot busy
value and also prior time periods to better smooth the value being used in
the decision.

I am guessing perhaps that some systems do have something more elaborate
than the simplistic mechanism that MPM appears to use by my reading of the
code. So what for example does mod_fcgid do?

Even using thread utilisation, one thing it cannot capture is queueing
time. That is, how long was a request sitting in the listener queue waiting
to be accepted.

Unfortunately I don't know of anyway to calculate this direct from the
operating system and so it generally relies on some front end sticking in a
header with a time stamp and looking at the elapsed time when it hits the
backend server. If the machines are on different servers though then you
got issues of clock skew to deal with.

Anyway, sorry for the long ramble.

I guess am just curious how busy is being calculated. Are there better ways
of calculating what busy is which are more accurate? Or does it mostly not
matter because when you start to reach higher levels of utilisations the
spot metric will tend towards becoming more reflective of actual
utilisation. Can additional measures, if they can be derived, such as
queueing time help getter a better picture of what is going on?

BTW, I have thought it would be quite interesting if the calculation done
in the respective MPMs for calculating whether processes should be started
or killed could be factored out into a modular concept. That way I could
more easily play with or provide an alternate algorithm which may be more
appropriate for a different use case without having to modify code. But
then this flexibility may well be unwarranted and what is there may be good
enough.

Graham

On 13 November 2012 02:04, Jim Jagielski <j...@jagunet.com> wrote:

> Booting the discussion:
>
>
> http://www.jimjag.com/imo/index.php?/archives/248-The-Case-for-a-Universal-Web-Server-Load-Value.html
>
>

Re: The Case for a Universal Web Server Load Value

Reply via email to