You say: """I have traditional Unix-type load-average and the percentage of how "idle" and "busy" the web-server is. But is that enough info? Or is that too much? How much data should the front-end want or need? Maybe a single agreed-upon value (ala "load average") is best... maybe not. These are the kinds of questions to answer."""
How is 'idle' and 'busy' measure being calculated? Now to deviate a bit in to related topic ..... One of the concerns I have had when looking over how MPMs work of late is that the measure of how many threads are busy used to determine whether processes should be created or destroyed is a spot measure. At least that is how I interpret the code and I could well be wrong, so please correct me if I am :-) That is, how many threads are in use only at the time the maintenance cycle is run is taken into consideration. In the Python world where one cannot preload in the Apache parent the Python interpreter or your application for various reasons, and need to defer it to child worker processes, recycling processes can be an expensive exercise as everything is done in the child after the fork. What worries me is that the current MPM calculation with using a spot measure isn't really a true indication of how much the server is being utilised over time. Imagine the worst case where you were under load and had a large number of concurrent requests and a commensurate number of processes, but a substantial number finished just before the maintenance cycle ran. The spot measure could use a quite low number which doesn't truly reflect the request load on the server in the period just before that, and what may therefore come after. As a result of a low number for a specific maintenance cycle, it could think it had more idle threads than needed and kill off one process. On next cycle one second later the maintenance cycle may hit again when high number of concurrent request and think it has to create a process again. Another case is where you had a momentary network issue and so requests were getting through and so for a short period the busy measure was low and number of processes progressively get killed off at a rate of one a second. Using a spot measure rather than looking at business over an extended window of time, especially when killing processes, could cause process recycling when not warranted or when it would be better that it simply didn't do it. The potential for this is in part avoided by what the min/max idle threads is set to. That is, it effectively smooths out small fluctuations, but because the busy measure is a spot metric, am still concerned that the randomness of when requests run means that the spot metric could still jump around quite a lot between maintenance cycles to the extent that could exceed min/max levels and so kill off processes. Now for a Python site where recycling processes is expensive, the solution is to reconfigure the MPM settings to start more servers at the outset and allow a lot more idle capacity. But we know how many people actually bother to tune these settings properly. Anyway, that has had me wondering, and why I ask how you are calculating 'idle' and 'busy', whether such busy measures should not perhaps be done differently so that it can look back in time at prior traffic during the period since last maintenance cycle or even beyond that. One way of doing this is looking at one I call thread utilisation or what some also refer to as instance busy. At this point it is going to be easier for me to refer to: http://blog.newrelic.com/2012/09/11/introducing-capacity-analysis-for-python/ which has some nice pictures and description to help explain this thread utilisation measure. The thread utilisation over time since last maintenance cycle could therefore be used, perhaps weighted in some way with current spot busy value and also prior time periods to better smooth the value being used in the decision. I am guessing perhaps that some systems do have something more elaborate than the simplistic mechanism that MPM appears to use by my reading of the code. So what for example does mod_fcgid do? Even using thread utilisation, one thing it cannot capture is queueing time. That is, how long was a request sitting in the listener queue waiting to be accepted. Unfortunately I don't know of anyway to calculate this direct from the operating system and so it generally relies on some front end sticking in a header with a time stamp and looking at the elapsed time when it hits the backend server. If the machines are on different servers though then you got issues of clock skew to deal with. Anyway, sorry for the long ramble. I guess am just curious how busy is being calculated. Are there better ways of calculating what busy is which are more accurate? Or does it mostly not matter because when you start to reach higher levels of utilisations the spot metric will tend towards becoming more reflective of actual utilisation. Can additional measures, if they can be derived, such as queueing time help getter a better picture of what is going on? BTW, I have thought it would be quite interesting if the calculation done in the respective MPMs for calculating whether processes should be started or killed could be factored out into a modular concept. That way I could more easily play with or provide an alternate algorithm which may be more appropriate for a different use case without having to modify code. But then this flexibility may well be unwarranted and what is there may be good enough. Graham On 13 November 2012 02:04, Jim Jagielski <j...@jagunet.com> wrote: > Booting the discussion: > > > http://www.jimjag.com/imo/index.php?/archives/248-The-Case-for-a-Universal-Web-Server-Load-Value.html > >