Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
There's only one run queue in the kernel. THe first task ready to run is put at the head of that queue, and anything arriving afterwards waits. Only if that first task blocks on a resource or takes a very long time, or a higher priority process becomes able to run due to an interrupt is that process taken out of the queue. Note that any I/O request that isn't completely handled by buffers will trigger the 'blocks on a resource' clause above, which means that jobs doing any real work will complete in an order determined by something other than the cpu and not strictly serialized. Also, most of my web servers are dual-cpu so even cpu bound processes may complete out of order. I think it's much easier to visualize how MRU helps when you look at one thing running at a time. And MRU works best when every process runs to completion instead of blocking, etc. But even if the process gets timesliced, blocked, etc, MRU still degrades gracefully. You'll get more processes in use, but still the numbers will remain small. Similarly, because of the non-deterministic nature of computer systems, Apache doesn't service requests on an LRU basis; you're comparing SpeedyCGI against a straw man. Apache's servicing algortihm approaches randomness, so you need to build a comparison between forced-MRU and random choice. Apache httpd's are scheduled on an LRU basis. This was discussed early in this thread. Apache uses a file-lock for its mutex around the accept call, and file-locking is implemented in the kernel using a round-robin (fair) selection in order to prevent starvation. This results in incoming requests being assigned to httpd's in an LRU fashion. But, if you are running a front/back end apache with a small number of spare servers configured on the back end there really won't be any idle perl processes during the busy times you care about. That is, the backends will all be running or apache will shut them down and there won't be any difference between MRU and LRU (the difference would be which idle process waits longer - if none are idle there is no difference). If you can tune it just right so you never run out of ram, then I think you could get the same performance as MRU on something like hello-world. Once the httpd's get into the kernel's run queue, they finish in the same order they were put there, unless they block on a resource, get timesliced or are pre-empted by a higher priority process. Which means they don't finish in the same order if (a) you have more than one cpu, (b) they do any I/O (including delivering the output back which they all do), or (c) some of them run long enough to consume a timeslice. Try it and see. I'm sure you'll run more processes with speedycgi, but you'll probably run a whole lot fewer perl interpreters and need less ram. Do you have a benchmark that does some real work (at least a dbm lookup) to compare against a front/back end mod_perl setup? No, but if you send me one, I'll run it.
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory
You know, I had brief look through some of the SpeedyCGI code yesterday, and I think the MRU process selection might be a bit of a red herring. I think the real reason Speedy won the memory test is the way it spawns processes. Please take a look at that code again. There's no smoke and mirrors, no red-herrings. Also, I don't look at the benchmarks as "winning" - I am not trying to start a mod_perl vs speedy battle here. Gunther wanted to know if there were "real bechmarks", so I reluctantly put them up. Here's how SpeedyCGI works (this is from version 2.02 of the code): When the frontend starts, it tries to quickly grab a backend from the front of the be_wait queue, which is a LIFO. This is in speedy_frontend.c, get_a_backend() function. If there aren't any idle be's, it puts itself onto the fe_wait queue. Same file, get_a_backend_hard(). If this fe (frontend) is at the front of the fe_wait queue, it "takes charge" and starts looking to see if a backend needs to be spawned. This is part of the "frontend_ping()" function. It will only spawn a be if no other backends are being spawned, so only one backend gets spawned at a time. Every frontend in the queue, drops into a sigsuspend and waits for an alarm signal. The alarm is set for 1-second. This is also in get_a_backend_hard(). When a backend is ready to handle code, it goes and looks at the fe_wait queue and if there are fe's there, it sends a SIGALRM to the one at the front, and sets the sent_sig flag for that fe. This done in speedy_group.c, speedy_group_sendsigs(). When a frontend wakes on an alarm (either due to a timeout, or due to a be waking it up), it looks at its sent_sig flag to see if it can now grab a be from the queue. If so it does that. If not, it runs various checks then goes back to sleep. In most cases, you should get a be from the lifo right at the beginning in the get_a_backend() function. Unless there aren't enough be's running, or somethign is killing them (bad perl code), or you've set the MaxBackends option to limit the number of be's. If I understand what's going on in Apache's source, once every second it has a look at the scoreboard and says "less than MinSpareServers are idle, so I'll start more" or "more than MaxSpareServers are idle, so I'll kill one". It only kills one per second. It starts by spawning one, but the number spawned goes up exponentially each time it sees there are still not enough idle servers, until it hits 32 per second. It's easy to see how this could result in spawning too many in response to sudden load, and then taking a long time to clear out the unnecessary ones. In contrast, Speedy checks on every request to see if there are enough backends running. If there aren't, it spawns more until there are as many backends as queued requests. Speedy does not check on every request to see if there are enough backends running. In most cases, the only thing the frontend does is grab an idle backend from the lifo. Only if there are none available does it start to worry about how many are running, etc. That means it never overshoots the mark. You're correct that speedy does try not to overshoot, but mainly because there's no point in overshooting - it just wastes swap space. But that's not the heart of the mechanism. There truly is a LIFO involved. Please read that code again, or run some tests. Speedy could overshoot by far, and the worst that would happen is that you would get a lot of idle backends sitting in virtual memory, which the kernel would page out, and then at some point they'll time out and die. Unless of course the load increases to a point where they're needed, in which case they would get used. If you have speedy installed, you can manually start backends yourself and test. Just run "speedy_backend script.pl " to start a backend. If you start lots of those on a script that says 'print "$$\n"', then run the frontend on the same script, you will still see the same pid over and over. This is the LIFO in action, reusing the same process over and over. Going back to your example up above, if Apache actually controlled the number of processes tightly enough to prevent building up idle servers, it wouldn't really matter much how processes were selected. If after the 1st and 2nd interpreters finish their run they went to the end of the queue instead of the beginning of it, that simply means they will sit idle until called for instead of some other two processes sitting idle until called for. If the systems were both efficient enough about spawning to only create as many interpreters as needed, none of them would be sitting idle and memory usage would always be as low as possible. I don't know if I'm explaining this very well, but the gist of my theory is that at any given time both
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
, each customer puts in their order immediately, then waits 50 minutes for it to arrive. In the second scenario each customer waits 40 minutes in to put in their order, then waits another 10 minutes for it to arrive. What I'm trying to show with this analogy is that no matter how many "simultaneous" requests you have, they all have to be serialized at some point because you only have one CPU. Either you can serialize them before they get to the perl interpreter, or afterward. Either way you wait on the CPU, and you get the same throughput. Does that help? I have just gotten around to reading this thread I've been saving for a rainy day. Well, it's not rainy, but I'm finally getting to it. Apologizes to those who hate when when people don't snip their reply mails but I am including it so that the entire context is not lost. Sam (or others who may understand Sam's explanation), I am still confused by this explanation of MRU helping when there are 10 processes serving 10 requests at all times. I understand MRU helping when the processes are not at max, but I don't see how it helps when they are at max utilization. It seems to me that if the wait is the same for mod_perl backend processes and speedyCGI processes, that it doesn't matter if some of the speedycgi processes cycle earlier than the mod_perl ones because all 10 will always be used. I did read and reread (once) the snippets about modeling concurrency and the HTTP waiting for an accept.. But I still don't understand how MRU helps when all the processes would be in use anyway. At that point they all have an equal chance of being called. Could you clarify this with a simpler example? Maybe 4 processes and a sample timeline of what happens to those when there are enough requests to keep all 4 busy all the time for speedyCGI and a mod_perl backend? At 04:32 AM 1/6/01 -0800, Sam Horrocks wrote: Let me just try to explain my reasoning. I'll define a couple of my base assumptions, in case you disagree with them. - Slices of CPU time doled out by the kernel are very small - so small that processes can be considered concurrent, even though technically they are handled serially. Don't agree. You're equating the model with the implemntation. Unix processes model concurrency, but when it comes down to it, if you don't have more CPU's than processes, you can only simulate concurrency. Each process runs until it either blocks on a resource (timer, network, disk, pipe to another process, etc), or a higher priority process pre-empts it, or it's taken so much time that the kernel wants to give another process a chance to run. - A set of requests can be considered "simultaneous" if they all arrive and start being handled in a period of time shorter than the time it takes to service a request. That sounds OK. Operating on these two assumptions, I say that 10 simultaneous requests will require 10 interpreters to service them. There's no way to handle them with fewer, unless you queue up some of the requests and make them wait. Right. And that waiting takes place: - In the mutex around the accept call in the httpd - In the kernel's run queue when the process is ready to run, but is waiting for other processes ahead of it. So, since there is only one CPU, then in both cases (mod_perl and SpeedyCGI), processes spend time waiting. But what happens in the case of SpeedyCGI is that while some of the httpd's are waiting, one of the earlier speedycgi perl interpreters has already finished its run through the perl code and has put itself back at the front of the speedycgi queue. And by the time that Nth httpd gets around to running, it can re-use that first perl interpreter instead of needing yet another process. This is why it's important that you don't assume that Unix is truly concurrent. I also say that if you have a top limit of 10 interpreters on your machine because of memory constraints, and you're sending in 10 simultaneous requests constantly, all interpreters will be used all the time. In that case it makes no difference to the throughput whether you use MRU or LRU. This is not true for SpeedyCGI, because of the reason I give above. 10 simultaneous requests will not necessarily require 10 interpreters. What you say would be true if you had 10 processors and could get true concurrency. But on single-cpu systems you usually don't need 10 unix processes to handle 10 requests concurrently, since they get serialized by the kernel anyways. I think the CPU slices are smaller than that. I don't know much about process scheduling, so I could be wrong. I would agree with you if we were ta
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
which/hello_world ab -t 30 -c 300 http://localhost/$which/hello_world Before running each test, I rebooted my system. Here's the software installed: angel: {139}# rpm -q -a |egrep -i 'mod_perl|speedy|apache' apache-1.3.9-4 speedycgi-2.02-1 apache-devel-1.3.9-4 speedycgi-apache-2.02-1 mod_perl-1.21-2 Here are some relevant parameters from my httpd.conf: MinSpareServers 8 MaxSpareServers 20 StartServers 10 MaxClients 150 MaxRequestsPerChild 1 SpeedyMaxRuns 0 At 03:19 AM 1/17/01 -0800, Sam Horrocks wrote: I think the major problem is that you're assuming that just because there are 10 constant concurrent requests, that there have to be 10 perl processes serving those requests at all times in order to get maximum throughput. The problem with that assumption is that there is only one CPU - ten processes cannot all run simultaneously anyways, so you don't really need ten perl interpreters. I've been trying to think of better ways to explain this. I'll try to explain with an analogy - it's sort-of lame, but maybe it'll give you a mental picture of what's happening. To eliminate some confusion, this analogy doesn't address LRU/MRU, nor waiting on other events like network or disk i/o. It only tries to explain why you don't necessarily need 10 perl-interpreters to handle a stream of 10 concurrent requests on a single-CPU system. You own a fast-food restaurant. The players involved are: Your customers. These represent the http requests. Your cashiers. These represent the perl interpreters. Your cook. You only have one. THis represents your CPU. The normal flow of events is this: A cashier gets an order from a customer. The cashier goes and waits until the cook is free, and then gives the order to the cook. The cook then cooks the meal, taking 5-minutes for each meal. The cashier waits for the meal to be ready, then takes the meal and gives it to the customer. The cashier then serves another customer. The cashier/customer interaction takes a very small amount of time. The analogy is this: An http request (customer) arrives. It is given to a perl interpreter (cashier). A perl interpreter must wait for all other perl interpreters ahead of it to finish using the CPU (the cook). It can't serve any other requests until it finishes this one. When its turn arrives, the perl interpreter uses the CPU to process the perl code. It then finishes and gives the results over to the http client (the customer). Now, say in this analogy you begin the day with 10 customers in the store. At each 5-minute interval thereafter another customer arrives. So at time 0, there is a pool of 10 customers. At time +5, another customer arrives. At time +10, another customer arrives, ad infinitum. You could hire 10 cashiers in order to handle this load. What would happen is that the 10 cashiers would fairly quickly get all the orders from the first 10 customers simultaneously, and then start waiting for the cook. The 10 cashiers would queue up. Casher #1 would put in the first order. Cashiers 2-9 would wait their turn. After 5-minutes, cashier number 1 would receive the meal, deliver it to customer #1, and then serve the next customer (#11) that just arrived at the 5-minute mark. Cashier #1 would take customer #11's order, then queue up and wait in line for the cook - there will be 9 other cashiers already in line, so the wait will be long. At the 10-minute mark, cashier #2 would receive a meal from the cook, deliver it to customer #2, then go on and serve the next customer (#12) that just arrived. Cashier #2 would then go and wait in line for the cook. This continues on through all the cashiers in order 1-10, then repeating, 1-10, ad infinitum. Now even though you have 10 cashiers, most of their time is spent waiting to put in an order to the cook. Starting with customer #11, all customers will wait 50-minutes for their meal. When customer #11 comes in he/she will immediately get to place an order, but it will take the cashier 45-minutes to wait for the cook to become free, and another 5-minutes for the meal to be cooked. Same is true for customer #12, and all customers from then on. Now, the question is, could you get the same throughput with fewer cashiers? Say you had 2 cashiers instead. The 10 customers are there waiting. The 2 cashiers take orders from customers #1 and #2. Cashier #1 then gives the order to the cook and waits. Cashier #2 waits in line for the cook behind cashier #1. At the 5-minute mark, the first meal is done. Cashier #1 delivers the meal to customer #1, then serves customer #3. Cashier #1 then goes and stands in line behind cashier #2. At the 10-minute mark, cashier #2's meal is ready - it's delivered to customer
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
There is no coffee. Only meals. No substitutions. :-) If we added coffee to the menu it would still have to be prepared by the cook. Remember that you only have one CPU, and all the perl interpreters large and small must gain access to that CPU in order to run. Sam I have a wide assortment of queries on a site, some of which take several minutes to execute, while others execute in less than one second. If understand this analogy correctly, I'd be better off with the current incarnation of mod_perl because there would be more cashiers around to serve the "quick cups of coffee" that many customers request at my dinner. Is this correct? Sam Horrocks wrote: I think the major problem is that you're assuming that just because there are 10 constant concurrent requests, that there have to be 10 perl processes serving those requests at all times in order to get maximum throughput. The problem with that assumption is that there is only one CPU - ten processes cannot all run simultaneously anyways, so you don't really need ten perl interpreters. I've been trying to think of better ways to explain this. I'll try to explain with an analogy - it's sort-of lame, but maybe it'll give you a mental picture of what's happening. To eliminate some confusion, this analogy doesn't address LRU/MRU, nor waiting on other events like network or disk i/o. It only tries to explain why you don't necessarily need 10 perl-interpreters to handle a stream of 10 concurrent requests on a single-CPU system. You own a fast-food restaurant. The players involved are: Your customers. These represent the http requests. Your cashiers. These represent the perl interpreters. Your cook. You only have one. THis represents your CPU. The normal flow of events is this: A cashier gets an order from a customer. The cashier goes and waits until the cook is free, and then gives the order to the cook. The cook then cooks the meal, taking 5-minutes for each meal. The cashier waits for the meal to be ready, then takes the meal and gives it to the customer. The cashier then serves another customer. The cashier/customer interaction takes a very small amount of time. The analogy is this: An http request (customer) arrives. It is given to a perl interpreter (cashier). A perl interpreter must wait for all other perl interpreters ahead of it to finish using the CPU (the cook). It can't serve any other requests until it finishes this one. When its turn arrives, the perl interpreter uses the CPU to process the perl code. It then finishes and gives the results over to the http client (the customer). Now, say in this analogy you begin the day with 10 customers in the store. At each 5-minute interval thereafter another customer arrives. So at time 0, there is a pool of 10 customers. At time +5, another customer arrives. At time +10, another customer arrives, ad infinitum. You could hire 10 cashiers in order to handle this load. What would happen is that the 10 cashiers would fairly quickly get all the orders from the first 10 customers simultaneously, and then start waiting for the cook. The 10 cashiers would queue up. Casher #1 would put in the first order. Cashiers 2-9 would wait their turn. After 5-minutes, cashier number 1 would receive the meal, deliver it to customer #1, and then serve the next customer (#11) that just arrived at the 5-minute mark. Cashier #1 would take customer #11's order, then queue up and wait in line for the cook - there will be 9 other cashiers already in line, so the wait will be long. At the 10-minute mark, cashier #2 would receive a meal from the cook, deliver it to customer #2, then go on and serve the next customer (#12) that just arrived. Cashier #2 would then go and wait in line for the cook. This continues on through all the cashiers in order 1-10, then repeating, 1-10, ad infinitum. Now even though you have 10 cashiers, most of their time is spent waiting to put in an order to the cook. Starting with customer #11, all customers will wait 50-minutes for their meal. When customer #11 comes in he/she will immediately get to place an order, but it will take the cashier 45-minutes to wait for the cook to become free, and another 5-minutes for the meal to be cooked. Same is true for customer #12, and all customers from then on. Now, the question is, could you get the same throughput with fewer cashiers? Say you had 2 cashiers instead. The 10 customers are there waiting. The 2 cashiers take orders from customers #1 and #2. Cashier #1 then gives the order to the cook and waits. Cashier #2 waits in line for the cook behind cashier #1. At the 5-minute mark, the first me
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
Let me just try to explain my reasoning. I'll define a couple of my base assumptions, in case you disagree with them. - Slices of CPU time doled out by the kernel are very small - so small that processes can be considered concurrent, even though technically they are handled serially. Don't agree. You're equating the model with the implemntation. Unix processes model concurrency, but when it comes down to it, if you don't have more CPU's than processes, you can only simulate concurrency. Each process runs until it either blocks on a resource (timer, network, disk, pipe to another process, etc), or a higher priority process pre-empts it, or it's taken so much time that the kernel wants to give another process a chance to run. - A set of requests can be considered "simultaneous" if they all arrive and start being handled in a period of time shorter than the time it takes to service a request. That sounds OK. Operating on these two assumptions, I say that 10 simultaneous requests will require 10 interpreters to service them. There's no way to handle them with fewer, unless you queue up some of the requests and make them wait. Right. And that waiting takes place: - In the mutex around the accept call in the httpd - In the kernel's run queue when the process is ready to run, but is waiting for other processes ahead of it. So, since there is only one CPU, then in both cases (mod_perl and SpeedyCGI), processes spend time waiting. But what happens in the case of SpeedyCGI is that while some of the httpd's are waiting, one of the earlier speedycgi perl interpreters has already finished its run through the perl code and has put itself back at the front of the speedycgi queue. And by the time that Nth httpd gets around to running, it can re-use that first perl interpreter instead of needing yet another process. This is why it's important that you don't assume that Unix is truly concurrent. I also say that if you have a top limit of 10 interpreters on your machine because of memory constraints, and you're sending in 10 simultaneous requests constantly, all interpreters will be used all the time. In that case it makes no difference to the throughput whether you use MRU or LRU. This is not true for SpeedyCGI, because of the reason I give above. 10 simultaneous requests will not necessarily require 10 interpreters. What you say would be true if you had 10 processors and could get true concurrency. But on single-cpu systems you usually don't need 10 unix processes to handle 10 requests concurrently, since they get serialized by the kernel anyways. I think the CPU slices are smaller than that. I don't know much about process scheduling, so I could be wrong. I would agree with you if we were talking about requests that were coming in with more time between them. Speedycgi will definitely use fewer interpreters in that case. This url: http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html says the default timeslice is 210ms (1/5th of a second) for Linux on a PC. There's also lots of good info there on Linux scheduling. I found that setting MaxClients to 100 stopped the paging. At concurrency level 100, both mod_perl and mod_speedycgi showed similar rates with ab. Even at higher levels (300), they were comparable. That's what I would expect if both systems have a similar limit of how many interpreters they can fit in RAM at once. Shared memory would help here, since it would allow more interpreters to run. By the way, do you limit the number of SpeedyCGI processes as well? it seems like you'd have to, or they'd start swapping too when you throw too many requests in. SpeedyCGI has an optional limit on the number of processes, but I didn't use it in my testing. But, to show that the underlying problem is still there, I then changed the hello_world script and doubled the amount of un-shared memory. And of course the problem then came back for mod_perl, although speedycgi continued to work fine. I think this shows that mod_perl is still using quite a bit more memory than speedycgi to provide the same service. I'm guessing that what happened was you ran mod_perl into swap again. You need to adjust MaxClients when your process size changes significantly. Right, but this also points out how difficult it is to get mod_perl tuning just right. My opinion is that the MRU design adapts more dynamically to the load. I believe that with speedycgi you don't have to lower the MaxClients setting, because it's able to handle a larger number of clients, at least in this test. Maybe what you're seeing is an ability to handle a larger number of requests (as opposed to clients) because of the performance benefit I mentioned above. I don't follow. When not all processes are in use, I
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
Right, but this also points out how difficult it is to get mod_perl tuning just right. My opinion is that the MRU design adapts more dynamically to the load. How would this compare to apache's process management when using the front/back end approach? Same thing applies. The front/back end approach does not change the fundamentals. I'd agree that the size of one Speedy backend + one httpd would be the same or even greater than the size of one mod_perl/httpd when no memory is shared. But because the speedycgi httpds are small (no perl in them) and the number of SpeedyCGI perl interpreters is small, the total memory required is significantly smaller for the same load. Likewise, it would be helpful if you would always make the comparison to the dual httpd setup that is often used for busy sites. I think it must really boil down to the efficiency of your IPC vs. access to the full apache environment. The reason I don't include that comparison is that it's not fundamental to the differences between mod_perl and speedycgi or LRU and MRU that I have been trying to point out. Regardless of whether you add a frontend or not, the mod_perl process selection remains LRU and the speedycgi process selection remains MRU.
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
A few things: - In your results, could you add the speedycgi version number (2.02), and the fact that this is using the mod_speedycgi frontend. The fork/exec frontend will be much slower on hello-world so I don't want people to get the wrong idea. You may want to benchmark the fork/exec version as well. - You may be able to eke out a little more performance by setting MaxRuns to 0 (infinite). The is set for mod_speedycgi using the SpeedyMaxRuns directive, or on the command-line using "-r0". This setting is similar to the MaxRequestsPerChild setting in apache. - My tests show mod_perl/speedy much closer than yours do, even with MaxRuns at its default value of 500. Maybe you're running on a different OS than I am - I'm using Redhat 6.2. I'm also running one rev lower of mod_perl in case that matters. Hey Sam, nice module. I just installed your SpeedyCGI for a good 'ol HelloWorld benchmark it was a snap, well done. I'd like to add to the numbers below that a fair benchmark would be between mod_proxy in front of a mod_perl server and mod_speedycgi, as it would be a similar memory saving model ( this is how we often scale mod_perl )... both models would end up forwarding back to a smaller set of persistent perl interpreters. However, I did not do such a benchmark, so SpeedyCGI looses out a bit for the extra layer it has to do :( This is based on the suite at http://www.chamas.com/bench/hello.tar.gz, but I have not included the speedy test in that yet. -- Josh Test Name Test File Hits/sec Total Hits Total Time sec/Hits Bytes/Hit -- -- -- -- -- -- Apache::Registry v2.01 CGI.pm hello.cgi 451.9 27128 hits 60.03 sec 0.002213 216 bytes Speedy CGI hello.cgi 375.2 22518 hits 60.02 sec 0.002665 216 bytes Apache Server Header Tokens --- (Unix) Apache/1.3.14 OpenSSL/0.9.6 PHP/4.0.3pl1 mod_perl/1.24 mod_ssl/2.7.1
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
Are the speedycgi+Apache processes smaller than the mod_perl processes? If not, the maximum number of concurrent requests you can handle on a given box is going to be the same. The size of the httpds running mod_speedycgi, plus the size of speedycgi perl processes is significantly smaller than the total size of the httpd's running mod_perl. That would be true if you only ran one mod_perl'd httpd, but can you give a better comparison to the usual setup for a busy site where you run a non-mod_perl lightweight front end and let mod_rewrite decide what is proxied through to the larger mod_perl'd backend, letting apache decide how many backends you need to have running? The fundamental differences would remain the same - even in the mod_perl backend, the requests will be spread out over all the httpd's that are running, whereas speedycgi would tend to use fewer perl interpreters to handle the same load. But with this setup, the mod_perl backend could probably be set to run fewer httpds because it doesn't have to wait on slow clients. And the fewer httpd's you run with mod_perl the smaller your total memory. The reason for this is that only a handful of perl processes are required by speedycgi to handle the same load, whereas mod_perl uses a perl interpreter in all of the httpds. I always see at least a 10-1 ratio of front-to-back end httpd's when serving over the internet. One effect that is difficult to benchmark is that clients connecting over the internet are often slow and will hold up the process that is delivering the data even though the processing has been completed. The proxy approach provides some buffering and allows the backend to move on more quickly. Does speedycgi do the same? There are plans to make it so that SpeedyCGI does more buffering of the output in memory, perhaps eliminating the need for caching frontend webserver. It works now only for the "speedy" binary (not mod_speedycgi) if you set the BufsizGet value high enough. Of course you could add a caching webserver in front of the SpeedyCGI server just like you do with mod_perl now. So yes you can do the same with speedycgi now.
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
Sorry for the late reply - I've been out for the holidays. By the way, how are you doing it? Do you use a mutex routine that works in LIFO fashion? Speedycgi uses separate backend processes that run the perl interpreters. The frontend processes (the httpd's that are running mod_speedycgi) communicate with the backends, sending over the request and getting the output. Speedycgi uses some shared memory (an mmap'ed file in /tmp) to keep track of the backends and frontends. This shared memory contains the queue. When backends become free, they add themselves at the front of this queue. When the frontends need a backend they pull the first one from the front of this list. I am saying that since SpeedyCGI uses MRU to allocate requests to perl interpreters, it winds up using a lot fewer interpreters to handle the same number of requests. What I was saying is that it doesn't make sense for one to need fewer interpreters than the other to handle the same concurrency. If you have 10 requests at the same time, you need 10 interpreters. There's no way speedycgi can do it with fewer, unless it actually makes some of them wait. That could be happening, due to the fork-on-demand model, although your warmup round (priming the pump) should take care of that. What you say would be true if you had 10 processors and could get true concurrency. But on single-cpu systems you usually don't need 10 unix processes to handle 10 requests concurrently, since they get serialized by the kernel anyways. I'll try to show how mod_perl handles 10 concurrent requests, and compare that to mod_speedycgi so you can see the difference. For mod_perl, let's assume we have 10 httpd's, h1 through h10, when the 10 concurent requests come in. h1 has aquired the mutex, and h2-h10 are waiting (in order) on the mutex. Here's how the cpu actually runs the processes: h1 accepts h1 releases the mutex, making h2 runnable h1 runs the perl code and produces the results h1 waits for the mutex h2 accepts h2 releases the mutex, making h3 runnable h2 runs the perl code and produces the results h2 waits for the mutex h3 accepts ... This is pretty straightforward. Each of h1-h10 run the perl code exactly once. They may not run exactly in this order since a process could get pre-empted, or blocked waiting to send data to the client, etc. But regardless, each of the 10 processes will run the perl code exactly once. Here's the mod_speedycgi example - it too uses httpd's h1-h10, and they all take turns running the mod_speedycgi frontend code. But the backends, where the perl code is, don't have to all be run fairly - they use MRU instead. I'll use b1 and b2 to represent 2 speedycgi backend processes, already queued up in that order. Here's a possible speedycgi scenario: h1 accepts h1 releases the mutex, making h2 runnable h1 sends a request to b1, making b1 runnable h2 accepts h2 releases the mutex, making h3 runnable h2 sends a request to b2, making b2 runnable b1 runs the perl code and sends the results to h1, making h1 runnable b1 adds itself to the front of the queue h3 accepts h3 releases the mutex, making h4 runnable h3 sends a request to b1, making b1 runnable b2 runs the perl code and sends the results to h2, making h2 runnable b2 adds itself to the front of the queue h1 produces the results it got from b1 h1 waits for the mutex h4 accepts h4 releases the mutex, making h5 runnable h4 sends a request to b2, making b2 runnable b1 runs the perl code and sends the results to h3, making h3 runnable b1 adds itself to the front of the queue h2 produces the results it got from b2 h2 waits for the mutex h5 accepts h5 release the mutex, making h6 runnable h5 sends a request to b1, making b1 runnable b2 runs the perl code and sends the results to h4, making h4 runnable b2 adds itself to the front of the queue This may be hard to follow, but hopefully you can see that the 10 httpd's just take turns using b1 and b2 over and over. So, the 10 conncurrent requests end up being handled by just two perl backend processes. Again, this is simplified. If the perl processes get blocked, or pre-empted, you'll end up using more of them. But generally, the LIFO will cause SpeedyCGI to sort-of settle into the smallest number of processes needed for the task. The difference between the two approaches is that the mod_perl implementation forces unix to use 10 separate perl processes, while the mod_speedycgi implementation sort-of decides on the fly how many different processes are needed. Please let me know what you think I should change. So far my benchmarks only show one trend, but if you can tell me specifically what I'm doing wrong (and it's something reasonable), I'll try it. Try setting MinSpareServers
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory
This is planned for a future release of speedycgi, though there will probably be an option to set a maximum number of bytes that can be bufferred before the frontend contacts a perl interpreter and starts passing over the bytes. Currently you can do this sort of acceleration with script output if you use the "speedy" binary (not mod_speedycgi), and you set the BufsizGet option high enough so that it's able to buffer all the output from your script. The perl interpreter will then be able to detach and go handle other requests while the frontend process waits for the output to drain. Perrin Harkins wrote: What I was saying is that it doesn't make sense for one to need fewer interpreters than the other to handle the same concurrency. If you have 10 requests at the same time, you need 10 interpreters. There's no way speedycgi can do it with fewer, unless it actually makes some of them wait. That could be happening, due to the fork-on-demand model, although your warmup round (priming the pump) should take care of that. I don't know if Speedy fixes this, but one problem with mod_perl v1 is that if, for instance, a large POST request is being uploaded, this takes a whole perl interpreter while the transaction is occurring. This is at least one place where a Perl interpreter should not be needed. Of course, this could be overcome if an HTTP Accelerator is used that takes the whole request before passing it to a local httpd, but I don't know of any proxies that work this way (AFAIK they all pass the packets as they arrive).
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
Gunther Birznieks wrote: Sam just posted this to the speedycgi list just now. [...] The underlying problem in mod_perl is that apache likes to spread out web requests to as many httpd's, and therefore as many mod_perl interpreters, as possible using an LRU selection processes for picking httpd's. Hmmm... this doesn't sound right. I've never looked at the code in Apache that does this selection, but I was under the impression that the choice of which process would handle each request was an OS dependent thing, based on some sort of mutex. Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html Doesn't that appear to be saying that whichever process gets into the mutex first will get the new request? I would agree that whichver process gets into the mutex first will get the new request. That's exactly the problem I'm describing. What you are describing here is first-in, first-out behaviour which implies LRU behaviour. Processes 1, 2, 3 are running. 1 finishes and requests the mutex, then 2 finishes and requests the mutex, then 3 finishes and requests the mutex. So when the next three requests come in, they are handled in the same order: 1, then 2, then 3 - this is FIFO or LRU. This is bad for performance. In my experience running development servers on Linux it always seemed as if the the requests would continue going to the same process until a request came in when that process was already busy. No, they don't. They go round-robin (or LRU as I say it). Try this simple test script: use CGI; my $cgi = CGI-new; print $cgi-header(); print "mypid=$$\n"; WIth mod_perl you constantly get different pids. WIth mod_speedycgi you usually get the same pid. THis is a really good way to see the LRU/MRU difference that I'm talking about. Here's the problem - the mutex in apache is implemented using a lock on a file. It's left up to the kernel to decide which process to give that lock to. Now, if you're writing a unix kernel and implementing this file locking code, what implementation would you use? Well, this is a general purpose thing - you have 100 or so processes all trying to acquire this file lock. You could give out the lock randomly or in some ordered fashion. If I were writing the kernel I would give it out in a round-robin fashion (or the least-recently-used process as I referred to it before). Why? Because otherwise one of those processes may starve waiting for this lock - it may never get the lock unless you do it in a fair (round-robin) manner. THe kernel doesn't know that all these httpd's are exactly the same. The kernel is implementing a general-purpose file-locking scheme and it doesn't know whether one process is more important than another. If it's not fair about giving out the lock a very important process might starve. Take a look at fs/locks.c (I'm looking at linux 2.3.46). In there is the comment: /* Insert waiter into blocker's block list. * We use a circular list so that processes can be easily woken up in * the order they blocked. The documentation doesn't require this but * it seems like the reasonable thing to do. */ static void locks_insert_block(struct file_lock *blocker, struct file_lock *waiter) As I understand it, the implementation of "wake-one" scheduling in the 2.4 Linux kernel may affect this as well. It may then be possible to skip the mutex and use unserialized accept for single socket servers, which will definitely hand process selection over to the kernel. If the kernel implemented the queueing for multiple accepts using a LIFO instead of a FIFO and apache used this method instead of file locks, then that would probably solve it. Just found this on the net on this subject: http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0455.html http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0453.html The problem is that at a high concurrency level, mod_perl is using lots and lots of different perl-interpreters to handle the requests, each with its own un-shared memory. It's doing this due to its LRU design. But with SpeedyCGI's MRU design, only a few speedy_backends are being used because as much as possible it tries to use the same interpreter over and over and not spread out the requests to lots of different interpreters. Mod_perl is using lots of perl-interpreters, while speedycgi is only using a few. mod_perl is requiring that lots of interpreters be in memory in order to handle the requests, wherase speedy only requires a small number of interpreters to be in memory. This test - building up unshared memory in each process - is somewhat suspect since in most setups I've seen, there is a very significant amount of memory being shared between mod_perl processes. My message and testing concerns un-shared memory only. If all of your memory is shared, then there shouldn't be a
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
Folks, your discussion is not short of wrong statements that can be easily proved, but I don't find it useful. I don't follow. Are you saying that my conclusions are wrong, but you don't want to bother explaining why? Would you agree with the following statement? Under apache-1, speedycgi scales better than mod_perl with scripts that contain un-shared memory
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
I've put your suggestion on the todo list. It certainly wouldn't hurt to have that feature, though I think memory sharing becomes a much much smaller issue once you switch to MRU scheduling. At the moment I think SpeedyCGI has more pressing needs though - for example multiple scripts in a single interpreter, and an NT port. I think you could actually make speedycgi even better for shared memory usage by creating a special directive which would indicate to speedycgi to preload a series of modules. And then to tell speedy cgi to do forking of that "master" backend preloaded module process and hand control over to that forked process whenever you need to launch a new process. Then speedy would potentially have the best of both worlds. Sorry I cross posted your thing. But I do think it is a problem of mod_perl also, and I am happily using speedycgi in production on at least one commercial site where mod_perl could not be installed so easily because of infrastructure issues. I believe your mechanism of round robining among MRU perl interpreters is actually also accomplished by ActiveState's PerlEx (based on Apache::Registry but using multithreaded IIS and pool of Interpreters). A method similar to this will be used in Apache 2.0 when Apache is multithreaded and therefore can control within program logic which Perl interpeter gets called from a pool of Perl interpreters. It just isn't so feasible right now in Apache 1.0 to do this. And sometimes people forget that mod_perl came about primarily for writing handlers in Perl not as an application environment although it is very good for the later as well. I think SpeedyCGI needs more advocacy from the mod_perl group because put simply speedycgi is way easier to set up and use than mod_perl and will likely get more PHP people using Perl again. If more people rely on Perl for their fast websites, then you will get more people looking for more power, and by extension more people using mod_perl. Whoops... here we go with the advocacy thing again. Later, Gunther At 02:50 AM 12/21/2000 -0800, Sam Horrocks wrote: Gunther Birznieks wrote: Sam just posted this to the speedycgi list just now. [...] The underlying problem in mod_perl is that apache likes to spread out web requests to as many httpd's, and therefore as many mod_perl interpreters, as possible using an LRU selection processes for picking httpd's. Hmmm... this doesn't sound right. I've never looked at the code in Apache that does this selection, but I was under the impression that the choice of which process would handle each request was an OS dependent thing, based on some sort of mutex. Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html Doesn't that appear to be saying that whichever process gets into the mutex first will get the new request? I would agree that whichver process gets into the mutex first will get the new request. That's exactly the problem I'm describing. What you are describing here is first-in, first-out behaviour which implies LRU behaviour. Processes 1, 2, 3 are running. 1 finishes and requests the mutex, then 2 finishes and requests the mutex, then 3 finishes and requests the mutex. So when the next three requests come in, they are handled in the same order: 1, then 2, then 3 - this is FIFO or LRU. This is bad for performance. In my experience running development servers on Linux it always seemed as if the the requests would continue going to the same process until a request came in when that process was already busy. No, they don't. They go round-robin (or LRU as I say it). Try this simple test script: use CGI; my $cgi = CGI-new; print $cgi-header(); print "mypid=$$\n"; WIth mod_perl you constantly get different pids. WIth mod_speedycgi you usually get the same pid. THis is a really good way to see the LRU/MRU difference that I'm talking about. Here's the problem - the mutex in apache is implemented using a lock on a file. It's left up to the kernel to decide which process to give that lock to. Now, if you're writing a unix kernel and implementing this file locking code, what implementation would you use? Well, this is a general purpose thing - you have 100 or so processes all trying to acquire this file lock. You could give out the lock randomly or in some ordered fashion. If I were writing the kernel I would give it out in a round-robin fashion (or the least-recently-used process as I referred to it before). Why? Because otherwise one of those processes may starve waiting for this lock - it may never get the lock unless you do it in a fair (round-robin) manner. THe
Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory
I really wasn't trying to work backwards from a benchmark. It was more of an analysis of the design, and the benchmarks bore it out. It's sort of like coming up with a theory in science - if you can't get any experimental data to back up the theory, you're in big trouble. But if you can at least point out the existence of some experiments that are consistent with your theory, it means your theory may be true. The best would be to have other people do the same tests and see if they see the same trend. If no-one else sees this trend, then I'd really have to re-think my analysis. Another way to look at it - as you say below MRU is going to be in mod_perl-2.0. ANd what is the reason for that? If there's no performance difference between LRU and MRU why would the author bother to switch to MRU. So, I'm saying there must be some benchmarks somewhere that point out this difference - if there weren't any real-world difference, why bother even implementing MRU. I claim that my benchmarks point out this difference between MRU over LRU, and that's why my benchmarks show better performance on speedycgi than on mod_perl. Sam - SpeedyCGI uses MRU, mod_perl-2 will eventually use MRU. On Thu, 21 Dec 2000, Sam Horrocks wrote: Folks, your discussion is not short of wrong statements that can be easily proved, but I don't find it useful. I don't follow. Are you saying that my conclusions are wrong, but you don't want to bother explaining why? Would you agree with the following statement? Under apache-1, speedycgi scales better than mod_perl with scripts that contain un-shared memory I don't know. It's easy to give a simple example and claim being better. So far whoever tried to show by benchmarks that he is better, most often was proved wrong, since the technologies in question have so many features, that I believe no benchmark will prove any of them absolutely superior or inferior. Therefore I said that trying to tell that your grass is greener is doomed to fail if someone has time on his hands to prove you wrong. Well, we don't have this time. Therefore I'm not trying to prove you wrong or right. Gunther's point of the original forward was to show things that mod_perl may need to adopt to make it better. Doug already explained in his paper that the MRU approach has been already implemented in mod_perl-2.0. You could read it in the link that I've attached and the quote that I've quoted. So your conclusions about MRU are correct and we have it implemented already (well very soon now :). I apologize if my original reply was misleading. I'm not telling that benchmarks are bad. What I'm telling is that it's very hard to benchmark things which are different. You benefit the most from the benchmarking when you take the initial code/product, benchmark it, then you try to improve the code and benchmark again to see whether it gave you any improvement. That's the area when the benchmarks rule and their are fair because you test the same thing. Well you could read more of my rambling about benchmarks in the guide. So if you find some cool features in other technologies that mod_perl might adopt and benefit from, don't hesitate to tell the rest of the gang. Something that I'd like to comment on: I find it a bad practice to quote one sentence from person's post and follow up on it. Someone from the list has sent me this email (SB == me): SB I don't find it useful and follow up. Why not to use a single letter: SB I and follow up? It's so much easier to flame on things taken out of their context. it has been no once that people did this to each other here on the list, I think I did too. So please be more careful when taking things out of context. Thanks a lot, folks! Cheers... _ Stas Bekman JAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide http://perl.apache.org/guide mailto:[EMAIL PROTECTED] http://apachetoday.com http://logilune.com/ http://singlesheaven.com http://perl.apache.org http://perlmonth.com/