Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
> You know, I had brief look through some of the SpeedyCGI code yesterday, > and I think the MRU process selection might be a bit of a red herring. > I think the real reason Speedy won the memory test is the way it spawns > processes. Please take a look at that code again. There's no smoke and mirrors, no red-herrings. Also, I don't look at the benchmarks as "winning" - I am not trying to start a mod_perl vs speedy battle here. Gunther wanted to know if there were "real bechmarks", so I reluctantly put them up. Here's how SpeedyCGI works (this is from version 2.02 of the code): When the frontend starts, it tries to quickly grab a backend from the front of the be_wait queue, which is a LIFO. This is in speedy_frontend.c, get_a_backend() function. If there aren't any idle be's, it puts itself onto the fe_wait queue. Same file, get_a_backend_hard(). If this fe (frontend) is at the front of the fe_wait queue, it "takes charge" and starts looking to see if a backend needs to be spawned. This is part of the "frontend_ping()" function. It will only spawn a be if no other backends are being spawned, so only one backend gets spawned at a time. Every frontend in the queue, drops into a sigsuspend and waits for an alarm signal. The alarm is set for 1-second. This is also in get_a_backend_hard(). When a backend is ready to handle code, it goes and looks at the fe_wait queue and if there are fe's there, it sends a SIGALRM to the one at the front, and sets the sent_sig flag for that fe. This done in speedy_group.c, speedy_group_sendsigs(). When a frontend wakes on an alarm (either due to a timeout, or due to a be waking it up), it looks at its sent_sig flag to see if it can now grab a be from the queue. If so it does that. If not, it runs various checks then goes back to sleep. In most cases, you should get a be from the lifo right at the beginning in the get_a_backend() function. Unless there aren't enough be's running, or somethign is killing them (bad perl code), or you've set the MaxBackends option to limit the number of be's. > If I understand what's going on in Apache's source, once every second it > has a look at the scoreboard and says "less than MinSpareServers are > idle, so I'll start more" or "more than MaxSpareServers are idle, so > I'll kill one". It only kills one per second. It starts by spawning > one, but the number spawned goes up exponentially each time it sees > there are still not enough idle servers, until it hits 32 per second. > It's easy to see how this could result in spawning too many in response > to sudden load, and then taking a long time to clear out the unnecessary > ones. > > In contrast, Speedy checks on every request to see if there are enough > backends running. If there aren't, it spawns more until there are as > many backends as queued requests. Speedy does not check on every request to see if there are enough backends running. In most cases, the only thing the frontend does is grab an idle backend from the lifo. Only if there are none available does it start to worry about how many are running, etc. > That means it never overshoots the mark. You're correct that speedy does try not to overshoot, but mainly because there's no point in overshooting - it just wastes swap space. But that's not the heart of the mechanism. There truly is a LIFO involved. Please read that code again, or run some tests. Speedy could overshoot by far, and the worst that would happen is that you would get a lot of idle backends sitting in virtual memory, which the kernel would page out, and then at some point they'll time out and die. Unless of course the load increases to a point where they're needed, in which case they would get used. If you have speedy installed, you can manually start backends yourself and test. Just run "speedy_backend script.pl &" to start a backend. If you start lots of those on a script that says 'print "$$\n"', then run the frontend on the same script, you will still see the same pid over and over. This is the LIFO in action, reusing the same process over and over. > Going back to your example up above, if Apache actually controlled the > number of processes tightly enough to prevent building up idle servers, > it wouldn't really matter much how processes were selected. If after > the 1st and 2nd interpreters finish their run they went to the end of > the queue instead of the beginning of it, that simply means they will > sit idle until called for instead of some other two processes sitting > idle until called for. If the systems were both efficient enough about > spawning to only create as many interpreters as needed, none of them > would be sitting idle and memory usage would always be as low as > possible. > > I don't know if I'm explaining this very well, but the gist of my theory > is tha
RE: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
There seems to be a lot of talk here, and analogies, and zero real-world benchmarking. Now it seems to me from reading this thread, that speedycgi would be better where you run 1 script, or only a few scripts, and mod_perl might win where you have a large application with hundreds of different URLs with different code being executed on each. That may change with the next release of speedy, but then lots of things will change with the next major release of mod_perl too, so its irrelevant until both are released. And as well as that, speedy still suffers (IMHO) that is still follows the CGI scripting model, whereas mod_perl offers a much more flexible environemt, and feature rich API (the Apache API). What's more, I could never build something like AxKit in speedycgi, without resorting to hacks like mod_rewrite to hide nasty URL's. At least thats my conclusion from first appearances. Either way, both solutions have their merits. Neither is going to totally replace the other. What I'd really like to do though is sum up this thread in a short article for take23. I'll see if I have time on Sunday to do it. -- /||** Director and CTO ** //||** AxKit.com Ltd ** ** XML Application Serving ** // ||** http://axkit.org ** ** XSLT, XPathScript, XSP ** // \\| // ** Personal Web Site: http://sergeant.org/ ** \\// //\\ // \\
RE: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
> > This doesn't affect the argument, because the core of it is that: > > > > a) the CPU will not completely process a single task all > at once; instead, > > it will divide its time _between_ the tasks > > b) tasks do not arrive at regular intervals > > c) tasks take varying amounts of time to complete > > [snip] > I won't agree with (a) unless you qualify it further - what > do you claim > is the method or policy for (a)? I think this has been answered ... basically, resource conflicts (including I/O), interrupts, long running tasks, higher priority tasks, and, of course, the process yielding, can all cause the CPU to switch processes (which of these qualify depends very much on the OS in question). This is why, despite the efficiency of single-task running, you can usefully run more than one process on a UNIX system. Otherwise, if you ran a single Apache process and had no traffic, you couldn't run a shell at the same time - Apache would consume practically all your CPU in its select() loop 8-) > Apache httpd's are scheduled on an LRU basis. This was > discussed early > in this thread. Apache uses a file-lock for its mutex > around the accept > call, and file-locking is implemented in the kernel using a > round-robin > (fair) selection in order to prevent starvation. This results in > incoming requests being assigned to httpd's in an LRU fashion. I'll apologise, and say, yes, of course you're right, but I do have a query: There are at (IIRC) 5 methods that Apache uses to serialize requests: fcntl(), flock(), Sys V semaphores, uslock (IRIX only) and Pthreads (reliably only on Solaris). Do they _all_ result in LRU? > Remember that the httpd's in the speedycgi case will have very little > un-shared memory, because they don't have perl interpreters in them. > So the processes are fairly indistinguishable, and the LRU isn't as > big a penalty in that case. Ye_but_, interpreter for interpreter, won't the equivalent speedycgi have roughly as much unshared memory as the mod_perl? I've had a lot of (dumb) discussions with people who complain about the size of Apache+mod_perl without realising that the interpreter code's all shared, and with pre-loading a lot of the perl code can be too. While I _can_ see speedycgi having an advantage (because it's got a much better overview of what's happening, and can intelligently manage the situation), I don't think it's as large as you're suggesting. I think this needs to be intensively benchmarked to answer that > other interpreters, and you expand the number of interpreters in use. > But still, you'll wind up using the smallest number of interpreters > required for the given load and timeslice. As soon as those 1st and > 2nd perl interpreters finish their run, they go back at the beginning > of the queue, and the 7th/ 8th or later requests can then > use them, etc. > Now you have a pool of maybe four interpreters, all being > used on an MRU > basis. But it won't expand beyond that set unless your load > goes up or > your program's CPU time requirements increase beyond another > timeslice. > MRU will ensure that whatever the number of interpreters in use, it > is the lowest possible, given the load, the CPU-time required by the > program and the size of the timeslice. Yep...no arguments here. SpeedyCGI should result in fewer interpreters. I will say that there are a lot of convincing reasons to follow the SpeedyCGI model rather than the mod_perl model, but I've generally thought that the increase in that kind of performance that can be obtained as sufficiently minimal as to not warrant the extra layer... thoughts, anyone? Stephen.
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
> > There's only one run queue in the kernel. THe first task ready to run is > put > > at the head of that queue, and anything arriving afterwards waits. Only > > if that first task blocks on a resource or takes a very long time, or > > a higher priority process becomes able to run due to an interrupt is that > > process taken out of the queue. > > Note that any I/O request that isn't completely handled by buffers will > trigger the 'blocks on a resource' clause above, which means that > jobs doing any real work will complete in an order determined by > something other than the cpu and not strictly serialized. Also, most > of my web servers are dual-cpu so even cpu bound processes may > complete out of order. I think it's much easier to visualize how MRU helps when you look at one thing running at a time. And MRU works best when every process runs to completion instead of blocking, etc. But even if the process gets timesliced, blocked, etc, MRU still degrades gracefully. You'll get more processes in use, but still the numbers will remain small. > > > Similarly, because of the non-deterministic nature of computer systems, > > > Apache doesn't service requests on an LRU basis; you're comparing > SpeedyCGI > > > against a straw man. Apache's servicing algortihm approaches randomness, > so > > > you need to build a comparison between forced-MRU and random choice. > > > > Apache httpd's are scheduled on an LRU basis. This was discussed early > > in this thread. Apache uses a file-lock for its mutex around the accept > > call, and file-locking is implemented in the kernel using a round-robin > > (fair) selection in order to prevent starvation. This results in > > incoming requests being assigned to httpd's in an LRU fashion. > > But, if you are running a front/back end apache with a small number > of spare servers configured on the back end there really won't be > any idle perl processes during the busy times you care about. That > is, the backends will all be running or apache will shut them down > and there won't be any difference between MRU and LRU (the > difference would be which idle process waits longer - if none are > idle there is no difference). If you can tune it just right so you never run out of ram, then I think you could get the same performance as MRU on something like hello-world. > > Once the httpd's get into the kernel's run queue, they finish in the > > same order they were put there, unless they block on a resource, get > > timesliced or are pre-empted by a higher priority process. > > Which means they don't finish in the same order if (a) you have > more than one cpu, (b) they do any I/O (including delivering the > output back which they all do), or (c) some of them run long enough > to consume a timeslice. > > > Try it and see. I'm sure you'll run more processes with speedycgi, but > > you'll probably run a whole lot fewer perl interpreters and need less ram. > > Do you have a benchmark that does some real work (at least a dbm > lookup) to compare against a front/back end mod_perl setup? No, but if you send me one, I'll run it.
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
Sam Horrocks wrote: > say they take two slices, and interpreters 1 and 2 get pre-empted and > go back into the queue. So then requests 5/6 in the queue have to use > other interpreters, and you expand the number of interpreters in use. > But still, you'll wind up using the smallest number of interpreters > required for the given load and timeslice. As soon as those 1st and > 2nd perl interpreters finish their run, they go back at the beginning > of the queue, and the 7th/ 8th or later requests can then use them, etc. > Now you have a pool of maybe four interpreters, all being used on an MRU > basis. But it won't expand beyond that set unless your load goes up or > your program's CPU time requirements increase beyond another timeslice. > MRU will ensure that whatever the number of interpreters in use, it > is the lowest possible, given the load, the CPU-time required by the > program and the size of the timeslice. You know, I had brief look through some of the SpeedyCGI code yesterday, and I think the MRU process selection might be a bit of a red herring. I think the real reason Speedy won the memory test is the way it spawns processes. If I understand what's going on in Apache's source, once every second it has a look at the scoreboard and says "less than MinSpareServers are idle, so I'll start more" or "more than MaxSpareServers are idle, so I'll kill one". It only kills one per second. It starts by spawning one, but the number spawned goes up exponentially each time it sees there are still not enough idle servers, until it hits 32 per second. It's easy to see how this could result in spawning too many in response to sudden load, and then taking a long time to clear out the unnecessary ones. In contrast, Speedy checks on every request to see if there are enough backends running. If there aren't, it spawns more until there are as many backends as queued requests. That means it never overshoots the mark. Going back to your example up above, if Apache actually controlled the number of processes tightly enough to prevent building up idle servers, it wouldn't really matter much how processes were selected. If after the 1st and 2nd interpreters finish their run they went to the end of the queue instead of the beginning of it, that simply means they will sit idle until called for instead of some other two processes sitting idle until called for. If the systems were both efficient enough about spawning to only create as many interpreters as needed, none of them would be sitting idle and memory usage would always be as low as possible. I don't know if I'm explaining this very well, but the gist of my theory is that at any given time both systems will require an equal number of in use interpreters to do an equal amount of work and the diffirentiator between the two is Apache's relatively poor estimate of how many processes should be available at any given time. I think this theory matches up nicely with the results of Sam's tests: when MaxClients prevents Apache from spawning too many processes, both systems have similar performance characteristics. There are some knobs to twiddle in Apache's source if anyone is interested in playing with it. You can change the frequency of the checks and the maximum number of servers spawned per check. I don't have much motivation to do this investigation myself, since I've already tuned our MaxClients and process size constraints to prevent problems with our application. - Perrin
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
- Original Message - From: "Sam Horrocks" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: "mod_perl list" <[EMAIL PROTECTED]>; "Stephen Anderson" <[EMAIL PROTECTED]> Sent: Thursday, January 18, 2001 10:38 PM Subject: Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory > There's only one run queue in the kernel. THe first task ready to run is put > at the head of that queue, and anything arriving afterwards waits. Only > if that first task blocks on a resource or takes a very long time, or > a higher priority process becomes able to run due to an interrupt is that > process taken out of the queue. Note that any I/O request that isn't completely handled by buffers will trigger the 'blocks on a resource' clause above, which means that jobs doing any real work will complete in an order determined by something other than the cpu and not strictly serialized. Also, most of my web servers are dual-cpu so even cpu bound processes may complete out of order. > > Similarly, because of the non-deterministic nature of computer systems, > > Apache doesn't service requests on an LRU basis; you're comparing SpeedyCGI > > against a straw man. Apache's servicing algortihm approaches randomness, so > > you need to build a comparison between forced-MRU and random choice. > > Apache httpd's are scheduled on an LRU basis. This was discussed early > in this thread. Apache uses a file-lock for its mutex around the accept > call, and file-locking is implemented in the kernel using a round-robin > (fair) selection in order to prevent starvation. This results in > incoming requests being assigned to httpd's in an LRU fashion. But, if you are running a front/back end apache with a small number of spare servers configured on the back end there really won't be any idle perl processes during the busy times you care about. That is, the backends will all be running or apache will shut them down and there won't be any difference between MRU and LRU (the difference would be which idle process waits longer - if none are idle there is no difference). > Once the httpd's get into the kernel's run queue, they finish in the > same order they were put there, unless they block on a resource, get > timesliced or are pre-empted by a higher priority process. Which means they don't finish in the same order if (a) you have more than one cpu, (b) they do any I/O (including delivering the output back which they all do), or (c) some of them run long enough to consume a timeslice. > Try it and see. I'm sure you'll run more processes with speedycgi, but > you'll probably run a whole lot fewer perl interpreters and need less ram. Do you have a benchmark that does some real work (at least a dbm lookup) to compare against a front/back end mod_perl setup? > Remember that the httpd's in the speedycgi case will have very little > un-shared memory, because they don't have perl interpreters in them. > So the processes are fairly indistinguishable, and the LRU isn't as > big a penalty in that case. > > This is why the original designers of Apache thought it was safe to > create so many httpd's. If they all have the same (shared) memory, > then creating a lot of them does not have much of a penalty. mod_perl > applications throw a big monkey wrench into this design when they add > a lot of unshared memory to the httpd's. This is part of the reason the front/back end mod_perl configuration works well, keeping the backend numbers low. The real win when serving over the internet, though, is that the perl memory is no longer tied up while delivering the output back over frequently slow connections. Les Mikesell [EMAIL PROTECTED]
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
> This doesn't affect the argument, because the core of it is that: > > a) the CPU will not completely process a single task all at once; instead, > it will divide its time _between_ the tasks > b) tasks do not arrive at regular intervals > c) tasks take varying amounts of time to complete > > Now, if (a) were true but (b) and (c) were not, then, yes, it would have the > same effective result as sequential processing. Tasks that arrived first > would finish first. In the real world however, (b) and (c) are usually true, > and it becomes practically impossible to predict which task handler (in this > case, a mod_perl process) will complete first. I'll agree with (b) and (c) - I ignored them to keep my analogy as simple as possible. Again, the goal of my analogy was to show that a stream of 10 concurrent requests can be handled with the same througput with a lot fewer than 10 perl interpreters. (b) and (c) don't really have an effect on that - they don't control the order in which processes arrive and get queued up for the CPU. I won't agree with (a) unless you qualify it further - what do you claim is the method or policy for (a)? There's only one run queue in the kernel. THe first task ready to run is put at the head of that queue, and anything arriving afterwards waits. Only if that first task blocks on a resource or takes a very long time, or a higher priority process becomes able to run due to an interrupt is that process taken out of the queue. It is inefficient for the unix kernel to be constantly switching very quickly from process to process, because it takes time to do context switches. Also, unless the processes share the same memory, some amount of the processor cache can get flushed when you switch processes because you're changing to a different set of memory pages. That's why it's best for overall throughput if the kernel keeps a single process running as long as it can. > Similarly, because of the non-deterministic nature of computer systems, > Apache doesn't service requests on an LRU basis; you're comparing SpeedyCGI > against a straw man. Apache's servicing algortihm approaches randomness, so > you need to build a comparison between forced-MRU and random choice. Apache httpd's are scheduled on an LRU basis. This was discussed early in this thread. Apache uses a file-lock for its mutex around the accept call, and file-locking is implemented in the kernel using a round-robin (fair) selection in order to prevent starvation. This results in incoming requests being assigned to httpd's in an LRU fashion. Once the httpd's get into the kernel's run queue, they finish in the same order they were put there, unless they block on a resource, get timesliced or are pre-empted by a higher priority process. > Thinking about it, assuming you are, at some time, servicing requests > _below_ system capacity, SpeedyCGI will always win in memory usage, and > probably have an edge in handling response time. My concern would be, does > it offer _enough_ of an edge? Especially bearing in mind, if I understand, > you could end runing anywhere up 2x as many processes (n Apache handlers + n > script handlers)? Try it and see. I'm sure you'll run more processes with speedycgi, but you'll probably run a whole lot fewer perl interpreters and need less ram. Remember that the httpd's in the speedycgi case will have very little un-shared memory, because they don't have perl interpreters in them. So the processes are fairly indistinguishable, and the LRU isn't as big a penalty in that case. This is why the original designers of Apache thought it was safe to create so many httpd's. If they all have the same (shared) memory, then creating a lot of them does not have much of a penalty. mod_perl applications throw a big monkey wrench into this design when they add a lot of unshared memory to the httpd's. > > No, homogeneity (or the lack of it) wouldn't make a > > difference. Those 3rd, > > 5th or 6th processes run only *after* the 1st and 2nd have > > finished using > > the CPU. And at that poiint you could re-use those > > interpreters that 1 and 2 > > were using. > > This, if you'll excuse me, is quite clearly wrong. See the above argument, > and imagine that tasks 1 and 2 happen to take three times as long to > complete than 3, and you should see that that they could all end being in > the scheduling queue together. Perhaps you're considering tasks which are > too small to take more than 1 or 2 timeslices, in which case, you're much > less likely to want to accelerate them. So far to keep things fairly simple I've assumed you take less than one time slice to run. A timeslice is fairly long on a linux pc (210ms). But say they take two slices, and interpreters 1 and 2 get pre-empted and go back into the queue. So then requests 5/6 in the queue have to use other interpreters, and you expand the number of interpr