Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-19 Thread Sam Horrocks

There's only one run queue in the kernel.  THe first task ready to run is
  put
at the head of that queue, and anything arriving afterwards waits.  Only
if that first task blocks on a resource or takes a very long time, or
a higher priority process becomes able to run due to an interrupt is that
process taken out of the queue.
  
  Note that any I/O request that isn't completely handled by buffers will
  trigger the 'blocks on a resource' clause above, which means that
  jobs doing any real work will complete in an order determined by
  something other than the cpu and not strictly serialized.  Also, most
  of my web servers are dual-cpu so even cpu bound processes may
  complete out of order.

 I think it's much easier to visualize how MRU helps when you look at one
 thing running at a time.  And MRU works best when every process runs
 to completion instead of blocking, etc.  But even if the process gets
 timesliced, blocked, etc, MRU still degrades gracefully.  You'll get
 more processes in use, but still the numbers will remain small.

 Similarly, because of the non-deterministic nature of computer systems,
 Apache doesn't service requests on an LRU basis; you're comparing
  SpeedyCGI
 against a straw man. Apache's servicing algortihm approaches randomness,
  so
 you need to build a comparison between forced-MRU and random choice.
  
Apache httpd's are scheduled on an LRU basis.  This was discussed early
in this thread.  Apache uses a file-lock for its mutex around the accept
call, and file-locking is implemented in the kernel using a round-robin
(fair) selection in order to prevent starvation.  This results in
incoming requests being assigned to httpd's in an LRU fashion.
  
  But, if you are running a front/back end apache with a small number
  of spare servers configured on the back end there really won't be
  any idle perl processes during the busy times you care about.  That
  is, the  backends will all be running or apache will shut them down
  and there won't be any difference between MRU and LRU (the
  difference would be which idle process waits longer - if none are
  idle there is no difference).

 If you can tune it just right so you never run out of ram, then I think
 you could get the same performance as MRU on something like hello-world.

Once the httpd's get into the kernel's run queue, they finish in the
same order they were put there, unless they block on a resource, get
timesliced or are pre-empted by a higher priority process.
  
  Which means they don't finish in the same order if (a) you have
  more than one cpu, (b) they do any I/O (including delivering the
  output back which they all do), or (c) some of them run long enough
  to consume a timeslice.
  
Try it and see.  I'm sure you'll run more processes with speedycgi, but
you'll probably run a whole lot fewer perl interpreters and need less ram.
  
  Do you have a benchmark that does some real work (at least a dbm
  lookup) to compare against a front/back end mod_perl setup?

 No, but if you send me one, I'll run it.



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withsc ripts that contain un-shared memory

2001-01-19 Thread Sam Horrocks

  You know, I had brief look through some of the SpeedyCGI code yesterday,
  and I think the MRU process selection might be a bit of a red herring. 
  I think the real reason Speedy won the memory test is the way it spawns
  processes.

 Please take a look at that code again.  There's no smoke and mirrors,
 no red-herrings.  Also, I don't look at the benchmarks as "winning" - I
 am not trying to start a mod_perl vs speedy battle here.  Gunther wanted
 to know if there were "real bechmarks", so I reluctantly put them up.

 Here's how SpeedyCGI works (this is from version 2.02 of the code):

When the frontend starts, it tries to quickly grab a backend from
the front of the be_wait queue, which is a LIFO.  This is in
speedy_frontend.c, get_a_backend() function.

If there aren't any idle be's, it puts itself onto the fe_wait queue.
Same file, get_a_backend_hard().

If this fe (frontend) is at the front of the fe_wait queue, it
"takes charge" and starts looking to see if a backend needs to be
spawned.  This is part of the "frontend_ping()" function.  It will
only spawn a be if no other backends are being spawned, so only
one backend gets spawned at a time.

Every frontend in the queue, drops into a sigsuspend and waits for an
alarm signal.  The alarm is set for 1-second.  This is also in
get_a_backend_hard().

When a backend is ready to handle code, it goes and looks at the fe_wait
queue and if there are fe's there, it sends a SIGALRM to the one at
the front, and sets the sent_sig flag for that fe.  This done in
speedy_group.c, speedy_group_sendsigs().

When a frontend wakes on an alarm (either due to a timeout, or due to
a be waking it up), it looks at its sent_sig flag to see if it can now
grab a be from the queue.  If so it does that.  If not, it runs various
checks then goes back to sleep.

 In most cases, you should get a be from the lifo right at the beginning
 in the get_a_backend() function.  Unless there aren't enough be's running,
 or somethign is killing them (bad perl code), or you've set the
 MaxBackends option to limit the number of be's.


  If I understand what's going on in Apache's source, once every second it
  has a look at the scoreboard and says "less than MinSpareServers are
  idle, so I'll start more" or "more than MaxSpareServers are idle, so
  I'll kill one".  It only kills one per second.  It starts by spawning
  one, but the number spawned goes up exponentially each time it sees
  there are still not enough idle servers, until it hits 32 per second. 
  It's easy to see how this could result in spawning too many in response
  to sudden load, and then taking a long time to clear out the unnecessary
  ones.
  
  In contrast, Speedy checks on every request to see if there are enough
  backends running.  If there aren't, it spawns more until there are as
  many backends as queued requests.
 
 Speedy does not check on every request to see if there are enough
 backends running.  In most cases, the only thing the frontend does is
 grab an idle backend from the lifo.  Only if there are none available
 does it start to worry about how many are running, etc.

  That means it never overshoots the mark.

 You're correct that speedy does try not to overshoot, but mainly
 because there's no point in overshooting - it just wastes swap space.
 But that's not the heart of the mechanism.  There truly is a LIFO
 involved.  Please read that code again, or run some tests.  Speedy
 could overshoot by far, and the worst that would happen is that you
 would get a lot of idle backends sitting in virtual memory, which the
 kernel would page out, and then at some point they'll time out and die.
 Unless of course the load increases to a point where they're needed,
 in which case they would get used.

 If you have speedy installed, you can manually start backends yourself
 and test.  Just run "speedy_backend script.pl " to start a backend.
 If you start lots of those on a script that says 'print "$$\n"', then
 run the frontend on the same script, you will still see the same pid
 over and over.  This is the LIFO in action, reusing the same process
 over and over.

  Going back to your example up above, if Apache actually controlled the
  number of processes tightly enough to prevent building up idle servers,
  it wouldn't really matter much how processes were selected.  If after
  the 1st and 2nd interpreters finish their run they went to the end of
  the queue instead of the beginning of it, that simply means they will
  sit idle until called for instead of some other two processes sitting
  idle until called for.  If the systems were both efficient enough about
  spawning to only create as many interpreters as needed, none of them
  would be sitting idle and memory usage would always be as low as
  possible.
  
  I don't know if I'm explaining this very well, but the gist of my theory
  is that at any given time both 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-17 Thread Sam Horrocks
, each customer
puts in their order immediately, then waits 50 minutes for it to arrive.
In the second scenario each customer waits 40 minutes in to put in
their order, then waits another 10 minutes for it to arrive.

What I'm trying to show with this analogy is that no matter how many
"simultaneous" requests you have, they all have to be serialized at
some point because you only have one CPU.  Either you can serialize them
before they get to the perl interpreter, or afterward.  Either way you
wait on the CPU, and you get the same throughput.

Does that help?

  I have just gotten around to reading this thread I've been saving for a 
  rainy day. Well, it's not rainy, but I'm finally getting to it. Apologizes 
  to those who hate when when people don't snip their reply mails but I am 
  including it so that the entire context is not lost.
  
  Sam (or others who may understand Sam's explanation),
  
  I am still confused by this explanation of MRU helping when there are 10 
  processes serving 10 requests at all times. I understand MRU helping when 
  the processes are not at max, but I don't see how it helps when they are at 
  max utilization.
  
  It seems to me that if the wait is the same for mod_perl backend processes 
  and speedyCGI processes, that it doesn't matter if some of the speedycgi 
  processes cycle earlier than the mod_perl ones because all 10 will always 
  be used.
  
  I did read and reread (once) the snippets about modeling concurrency and 
  the HTTP waiting for an accept.. But I still don't understand how MRU helps 
  when all the processes would be in use anyway. At that point they all have 
  an equal chance of being called.
  
  Could you clarify this with a simpler example? Maybe 4 processes and a 
  sample timeline of what happens to those when there are enough requests to 
  keep all 4 busy all the time for speedyCGI and a mod_perl backend?
  
  At 04:32 AM 1/6/01 -0800, Sam Horrocks wrote:
 Let me just try to explain my reasoning.  I'll define a couple of my
 base assumptions, in case you disagree with them.

 - Slices of CPU time doled out by the kernel are very small - so small
 that processes can be considered concurrent, even though technically
 they are handled serially.
  
Don't agree.  You're equating the model with the implemntation.
Unix processes model concurrency, but when it comes down to it, if you
don't have more CPU's than processes, you can only simulate concurrency.
  
Each process runs until it either blocks on a resource (timer, network,
disk, pipe to another process, etc), or a higher priority process
pre-empts it, or it's taken so much time that the kernel wants to give
another process a chance to run.
  
 - A set of requests can be considered "simultaneous" if they all arrive
 and start being handled in a period of time shorter than the time it
 takes to service a request.
  
That sounds OK.
  
 Operating on these two assumptions, I say that 10 simultaneous requests
 will require 10 interpreters to service them.  There's no way to handle
 them with fewer, unless you queue up some of the requests and make them
 wait.
  
Right.  And that waiting takes place:
  
   - In the mutex around the accept call in the httpd
  
   - In the kernel's run queue when the process is ready to run, but is
 waiting for other processes ahead of it.
  
So, since there is only one CPU, then in both cases (mod_perl and
SpeedyCGI), processes spend time waiting.  But what happens in the
case of SpeedyCGI is that while some of the httpd's are waiting,
one of the earlier speedycgi perl interpreters has already finished
its run through the perl code and has put itself back at the front of
the speedycgi queue.  And by the time that Nth httpd gets around to
running, it can re-use that first perl interpreter instead of needing
yet another process.
  
This is why it's important that you don't assume that Unix is truly
concurrent.
  
 I also say that if you have a top limit of 10 interpreters on your
 machine because of memory constraints, and you're sending in 10
 simultaneous requests constantly, all interpreters will be used all the
 time.  In that case it makes no difference to the throughput whether you
 use MRU or LRU.
  
This is not true for SpeedyCGI, because of the reason I give above.
10 simultaneous requests will not necessarily require 10 interpreters.
  
   What you say would be true if you had 10 processors and could get
   true concurrency.  But on single-cpu systems you usually don't need
   10 unix processes to handle 10 requests concurrently, since they get
   serialized by the kernel anyways.

 I think the CPU slices are smaller than that.  I don't know much about
 process scheduling, so I could be wrong.  I would agree with you if we
 were ta

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-17 Thread Sam Horrocks
which/hello_world
ab -t 30 -c 300 http://localhost/$which/hello_world


Before running each test, I rebooted my system.  Here's the software
installed:

angel: {139}# rpm -q -a |egrep -i 'mod_perl|speedy|apache'
apache-1.3.9-4
speedycgi-2.02-1
apache-devel-1.3.9-4
speedycgi-apache-2.02-1
mod_perl-1.21-2

Here are some relevant parameters from my httpd.conf:

MinSpareServers 8
MaxSpareServers 20
StartServers 10
MaxClients 150
MaxRequestsPerChild 1
SpeedyMaxRuns 0









  At 03:19 AM 1/17/01 -0800, Sam Horrocks wrote:
  I think the major problem is that you're assuming that just because
  there are 10 constant concurrent requests, that there have to be 10
  perl processes serving those requests at all times in order to get
  maximum throughput.  The problem with that assumption is that there
  is only one CPU - ten processes cannot all run simultaneously anyways,
  so you don't really need ten perl interpreters.
  
  I've been trying to think of better ways to explain this.  I'll try to
  explain with an analogy - it's sort-of lame, but maybe it'll give you
  a mental picture of what's happening.  To eliminate some confusion,
  this analogy doesn't address LRU/MRU, nor waiting on other events like
  network or disk i/o.  It only tries to explain why you don't necessarily
  need 10 perl-interpreters to handle a stream of 10 concurrent requests
  on a single-CPU system.
  
  You own a fast-food restaurant.  The players involved are:
  
   Your customers.  These represent the http requests.
  
   Your cashiers.  These represent the perl interpreters.
  
   Your cook.  You only have one.  THis represents your CPU.
  
  The normal flow of events is this:
  
   A cashier gets an order from a customer.  The cashier goes and
   waits until the cook is free, and then gives the order to the cook.
   The cook then cooks the meal, taking 5-minutes for each meal.
   The cashier waits for the meal to be ready, then takes the meal and
   gives it to the customer.  The cashier then serves another customer.
   The cashier/customer interaction takes a very small amount of time.
  
  The analogy is this:
  
   An http request (customer) arrives.  It is given to a perl
   interpreter (cashier).  A perl interpreter must wait for all other
   perl interpreters ahead of it to finish using the CPU (the cook).
   It can't serve any other requests until it finishes this one.
   When its turn arrives, the perl interpreter uses the CPU to process
   the perl code.  It then finishes and gives the results over to the
   http client (the customer).
  
  Now, say in this analogy you begin the day with 10 customers in the store.
  At each 5-minute interval thereafter another customer arrives.  So at time
  0, there is a pool of 10 customers.  At time +5, another customer arrives.
  At time +10, another customer arrives, ad infinitum.
  
  You could hire 10 cashiers in order to handle this load.  What would
  happen is that the 10 cashiers would fairly quickly get all the orders
  from the first 10 customers simultaneously, and then start waiting for
  the cook.  The 10 cashiers would queue up.  Casher #1 would put in the
  first order.  Cashiers 2-9 would wait their turn.  After 5-minutes,
  cashier number 1 would receive the meal, deliver it to customer #1, and
  then serve the next customer (#11) that just arrived at the 5-minute mark.
  Cashier #1 would take customer #11's order, then queue up and wait in
  line for the cook - there will be 9 other cashiers already in line, so
  the wait will be long.  At the 10-minute mark, cashier #2 would receive
  a meal from the cook, deliver it to customer #2, then go on and serve
  the next customer (#12) that just arrived.  Cashier #2 would then go and
  wait in line for the cook.  This continues on through all the cashiers
  in order 1-10, then repeating, 1-10, ad infinitum.
  
  Now even though you have 10 cashiers, most of their time is spent
  waiting to put in an order to the cook.  Starting with customer #11,
  all customers will wait 50-minutes for their meal.  When customer #11
  comes in he/she will immediately get to place an order, but it will take
  the cashier 45-minutes to wait for the cook to become free, and another
  5-minutes for the meal to be cooked.  Same is true for customer #12,
  and all customers from then on.
  
  Now, the question is, could you get the same throughput with fewer
  cashiers?  Say you had 2 cashiers instead.  The 10 customers are
  there waiting.  The 2 cashiers take orders from customers #1 and #2.
  Cashier #1 then gives the order to the cook and waits.  Cashier #2 waits
  in line for the cook behind cashier #1.  At the 5-minute mark, the first
  meal is done.  Cashier #1 delivers the meal to customer #1, then serves
  customer #3.  Cashier #1 then goes and stands in line behind cashier #2.
  At the 10-minute mark, cashier #2's meal is ready - it's delivered to
  customer 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-17 Thread Sam Horrocks

There is no coffee.  Only meals.  No substitutions. :-)

If we added coffee to the menu it would still have to be prepared by the cook.
Remember that you only have one CPU, and all the perl interpreters large and
small must gain access to that CPU in order to run.

Sam


  I have a wide assortment of queries on a site, some of which take several minutes to 
 execute, while others execute in less than one second. If understand this analogy 
 correctly, I'd be better off with the current incarnation of mod_perl because there 
 would be more cashiers around to serve the "quick cups of coffee" that many customers 
 request at my dinner.
  
  Is this correct?
  
  
  Sam Horrocks wrote:
   
   I think the major problem is that you're assuming that just because
   there are 10 constant concurrent requests, that there have to be 10
   perl processes serving those requests at all times in order to get
   maximum throughput.  The problem with that assumption is that there
   is only one CPU - ten processes cannot all run simultaneously anyways,
   so you don't really need ten perl interpreters.
   
   I've been trying to think of better ways to explain this.  I'll try to
   explain with an analogy - it's sort-of lame, but maybe it'll give you
   a mental picture of what's happening.  To eliminate some confusion,
   this analogy doesn't address LRU/MRU, nor waiting on other events like
   network or disk i/o.  It only tries to explain why you don't necessarily
   need 10 perl-interpreters to handle a stream of 10 concurrent requests
   on a single-CPU system.
   
   You own a fast-food restaurant.  The players involved are:
   
   Your customers.  These represent the http requests.
   
   Your cashiers.  These represent the perl interpreters.
   
   Your cook.  You only have one.  THis represents your CPU.
   
   The normal flow of events is this:
   
   A cashier gets an order from a customer.  The cashier goes and
   waits until the cook is free, and then gives the order to the cook.
   The cook then cooks the meal, taking 5-minutes for each meal.
   The cashier waits for the meal to be ready, then takes the meal and
   gives it to the customer.  The cashier then serves another customer.
   The cashier/customer interaction takes a very small amount of time.
   
   The analogy is this:
   
   An http request (customer) arrives.  It is given to a perl
   interpreter (cashier).  A perl interpreter must wait for all other
   perl interpreters ahead of it to finish using the CPU (the cook).
   It can't serve any other requests until it finishes this one.
   When its turn arrives, the perl interpreter uses the CPU to process
   the perl code.  It then finishes and gives the results over to the
   http client (the customer).
   
   Now, say in this analogy you begin the day with 10 customers in the store.
   At each 5-minute interval thereafter another customer arrives.  So at time
   0, there is a pool of 10 customers.  At time +5, another customer arrives.
   At time +10, another customer arrives, ad infinitum.
   
   You could hire 10 cashiers in order to handle this load.  What would
   happen is that the 10 cashiers would fairly quickly get all the orders
   from the first 10 customers simultaneously, and then start waiting for
   the cook.  The 10 cashiers would queue up.  Casher #1 would put in the
   first order.  Cashiers 2-9 would wait their turn.  After 5-minutes,
   cashier number 1 would receive the meal, deliver it to customer #1, and
   then serve the next customer (#11) that just arrived at the 5-minute mark.
   Cashier #1 would take customer #11's order, then queue up and wait in
   line for the cook - there will be 9 other cashiers already in line, so
   the wait will be long.  At the 10-minute mark, cashier #2 would receive
   a meal from the cook, deliver it to customer #2, then go on and serve
   the next customer (#12) that just arrived.  Cashier #2 would then go and
   wait in line for the cook.  This continues on through all the cashiers
   in order 1-10, then repeating, 1-10, ad infinitum.
   
   Now even though you have 10 cashiers, most of their time is spent
   waiting to put in an order to the cook.  Starting with customer #11,
   all customers will wait 50-minutes for their meal.  When customer #11
   comes in he/she will immediately get to place an order, but it will take
   the cashier 45-minutes to wait for the cook to become free, and another
   5-minutes for the meal to be cooked.  Same is true for customer #12,
   and all customers from then on.
   
   Now, the question is, could you get the same throughput with fewer
   cashiers?  Say you had 2 cashiers instead.  The 10 customers are
   there waiting.  The 2 cashiers take orders from customers #1 and #2.
   Cashier #1 then gives the order to the cook and waits.  Cashier #2 waits
   in line for the cook behind cashier #1.  At the 5-minute mark, the first
   me

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Sam Horrocks

  Let me just try to explain my reasoning.  I'll define a couple of my
  base assumptions, in case you disagree with them.
  
  - Slices of CPU time doled out by the kernel are very small - so small
  that processes can be considered concurrent, even though technically
  they are handled serially.

 Don't agree.  You're equating the model with the implemntation.
 Unix processes model concurrency, but when it comes down to it, if you
 don't have more CPU's than processes, you can only simulate concurrency.

 Each process runs until it either blocks on a resource (timer, network,
 disk, pipe to another process, etc), or a higher priority process
 pre-empts it, or it's taken so much time that the kernel wants to give
 another process a chance to run.

  - A set of requests can be considered "simultaneous" if they all arrive
  and start being handled in a period of time shorter than the time it
  takes to service a request.

 That sounds OK.

  Operating on these two assumptions, I say that 10 simultaneous requests
  will require 10 interpreters to service them.  There's no way to handle
  them with fewer, unless you queue up some of the requests and make them
  wait.

 Right.  And that waiting takes place:

- In the mutex around the accept call in the httpd

- In the kernel's run queue when the process is ready to run, but is
  waiting for other processes ahead of it.

 So, since there is only one CPU, then in both cases (mod_perl and
 SpeedyCGI), processes spend time waiting.  But what happens in the
 case of SpeedyCGI is that while some of the httpd's are waiting,
 one of the earlier speedycgi perl interpreters has already finished
 its run through the perl code and has put itself back at the front of
 the speedycgi queue.  And by the time that Nth httpd gets around to
 running, it can re-use that first perl interpreter instead of needing
 yet another process.

 This is why it's important that you don't assume that Unix is truly
 concurrent.

  I also say that if you have a top limit of 10 interpreters on your
  machine because of memory constraints, and you're sending in 10
  simultaneous requests constantly, all interpreters will be used all the
  time.  In that case it makes no difference to the throughput whether you
  use MRU or LRU.

 This is not true for SpeedyCGI, because of the reason I give above.
 10 simultaneous requests will not necessarily require 10 interpreters.

What you say would be true if you had 10 processors and could get
true concurrency.  But on single-cpu systems you usually don't need
10 unix processes to handle 10 requests concurrently, since they get
serialized by the kernel anyways.
  
  I think the CPU slices are smaller than that.  I don't know much about
  process scheduling, so I could be wrong.  I would agree with you if we
  were talking about requests that were coming in with more time between
  them.  Speedycgi will definitely use fewer interpreters in that case.

 This url:

http://www.oreilly.com/catalog/linuxkernel/chapter/ch10.html

 says the default timeslice is 210ms (1/5th of a second) for Linux on a PC.
 There's also lots of good info there on Linux scheduling.

I found that setting MaxClients to 100 stopped the paging.  At concurrency
level 100, both mod_perl and mod_speedycgi showed similar rates with ab.
Even at higher levels (300), they were comparable.
  
  That's what I would expect if both systems have a similar limit of how
  many interpreters they can fit in RAM at once.  Shared memory would help
  here, since it would allow more interpreters to run.
  
  By the way, do you limit the number of SpeedyCGI processes as well?  it
  seems like you'd have to, or they'd start swapping too when you throw
  too many requests in.

 SpeedyCGI has an optional limit on the number of processes, but I didn't
 use it in my testing.

But, to show that the underlying problem is still there, I then changed
the hello_world script and doubled the amount of un-shared memory.
And of course the problem then came back for mod_perl, although speedycgi
continued to work fine.  I think this shows that mod_perl is still
using quite a bit more memory than speedycgi to provide the same service.
  
  I'm guessing that what happened was you ran mod_perl into swap again. 
  You need to adjust MaxClients when your process size changes
  significantly.

 Right, but this also points out how difficult it is to get mod_perl
 tuning just right.  My opinion is that the MRU design adapts more
 dynamically to the load.

   I believe that with speedycgi you don't have to lower the MaxClients
   setting, because it's able to handle a larger number of clients, at
   least in this test.

 Maybe what you're seeing is an ability to handle a larger number of
 requests (as opposed to clients) because of the performance benefit I
 mentioned above.
   
I don't follow.
  
  When not all processes are in use, I 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Sam Horrocks

Right, but this also points out how difficult it is to get mod_perl
tuning just right.  My opinion is that the MRU design adapts more
dynamically to the load.
  
  How would this compare to apache's process management when
  using the front/back end approach?

 Same thing applies.  The front/back end approach does not change the
 fundamentals.

I'd agree that the size of one Speedy backend + one httpd would be the
same or even greater than the size of one mod_perl/httpd when no memory
is shared.  But because the speedycgi httpds are small (no perl in them)
and the number of SpeedyCGI perl interpreters is small, the total memory
required is significantly smaller for the same load.
  
  Likewise, it would be helpful if you would always make the comparison
  to the dual httpd setup that is often used for busy sites.   I think it must
  really boil down to the efficiency of your IPC vs. access to the full
  apache environment.

 The reason I don't include that comparison is that it's not fundamental
 to the differences between mod_perl and speedycgi or LRU and MRU that
 I have been trying to point out.  Regardless of whether you add a
 frontend or not, the mod_perl process selection remains LRU and the
 speedycgi process selection remains MRU.



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-06 Thread Sam Horrocks

A few things:

- In your results, could you add the speedycgi version number (2.02),
  and the fact that this is using the mod_speedycgi frontend.
  The fork/exec frontend will be much slower on hello-world so I don't
  want people to get the wrong idea.  You may want to benchmark
  the fork/exec version as well.

- You may be able to eke out a little more performance by setting
  MaxRuns to 0 (infinite).  The is set for mod_speedycgi using the
  SpeedyMaxRuns directive, or on the command-line using "-r0".
  This setting is similar to the MaxRequestsPerChild setting in apache.

- My tests show mod_perl/speedy much closer than yours do, even with
  MaxRuns at its default value of 500.  Maybe you're running on
  a different OS than I am - I'm using Redhat 6.2.  I'm also running
  one rev lower of mod_perl in case that matters.


  Hey Sam, nice module.  I just installed your SpeedyCGI for a good 'ol
  HelloWorld benchmark  it was a snap, well done.  I'd like to add to the 
  numbers below that a fair benchmark would be between mod_proxy in front 
  of a mod_perl server and mod_speedycgi, as it would be a similar memory 
  saving model ( this is how we often scale mod_perl )... both models would
  end up forwarding back to a smaller set of persistent perl interpreters.
  
  However, I did not do such a benchmark, so SpeedyCGI looses out a
  bit for the extra layer it has to do :(   This is based on the 
  suite at http://www.chamas.com/bench/hello.tar.gz, but I have not
  included the speedy test in that yet.
  
   -- Josh
  
  Test Name  Test File  Hits/sec   Total Hits Total Time sec/Hits  
  Bytes/Hit  
     -- -- -- -- 
 -- -- 
  Apache::Registry v2.01 CGI.pm  hello.cgi   451.9 27128 hits 60.03 sec  0.002213  
  216 bytes  
  Speedy CGI hello.cgi   375.2 22518 hits 60.02 sec  0.002665  
  216 bytes  
  
  Apache Server Header Tokens
  ---
  (Unix)
  Apache/1.3.14
  OpenSSL/0.9.6
  PHP/4.0.3pl1
  mod_perl/1.24
  mod_ssl/2.7.1



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-05 Thread Sam Horrocks

 Are the speedycgi+Apache processes smaller than the mod_perl
 processes?  If not, the maximum number of concurrent requests you can
 handle on a given box is going to be the same.
  
The size of the httpds running mod_speedycgi, plus the size of speedycgi
perl processes is significantly smaller than the total size of the httpd's
running mod_perl.
  
  That would be true if you only ran one mod_perl'd httpd, but can you
  give a better comparison to the usual setup for a busy site where
  you run a non-mod_perl lightweight front end and let mod_rewrite
  decide what is proxied through to the larger mod_perl'd backend,
  letting apache decide how many backends you need to have
  running?

 The fundamental differences would remain the same - even in the mod_perl
 backend, the requests will be spread out over all the httpd's that are
 running, whereas speedycgi would tend to use fewer perl interpreters
 to handle the same load.

 But with this setup, the mod_perl backend could probably be set to run
 fewer httpds because it doesn't have to wait on slow clients.  And the
 fewer httpd's you run with mod_perl the smaller your total memory.

The reason for this is that only a handful of perl processes are required by
speedycgi to handle the same load, whereas mod_perl uses a perl interpreter
in all of the httpds.
  
  I always see at least a 10-1 ratio of front-to-back end httpd's when serving
  over the internet.   One effect that is difficult to benchmark is that clients
  connecting over the internet are often slow and will hold up the process
  that is delivering the data even though the processing has been completed.
  The proxy approach provides some buffering and allows the backend
  to move on more quickly.  Does speedycgi do the same?

 There are plans to make it so that SpeedyCGI does more buffering of
 the output in memory, perhaps eliminating the need for caching frontend
 webserver.  It works now only for the "speedy" binary (not mod_speedycgi)
 if you set the BufsizGet value high enough.

 Of course you could add a caching webserver in front of the SpeedyCGI server
 just like you do with mod_perl now.  So yes you can do the same with
 speedycgi now.



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2001-01-04 Thread Sam Horrocks

Sorry for the late reply - I've been out for the holidays.

  By the way, how are you doing it?  Do you use a mutex routine that works
  in LIFO fashion?

 Speedycgi uses separate backend processes that run the perl interpreters.
 The frontend processes (the httpd's that are running mod_speedycgi)
 communicate with the backends, sending over the request and getting the output.

 Speedycgi uses some shared memory (an mmap'ed file in /tmp) to keep track
 of the backends and frontends.  This shared memory contains the queue.
 When backends become free, they add themselves at the front of this queue.
 When the frontends need a backend they pull the first one from the front
 of this list.

  
I am saying that since SpeedyCGI uses MRU to allocate requests to perl
interpreters, it winds up using a lot fewer interpreters to handle the
same number of requests.
  
  What I was saying is that it doesn't make sense for one to need fewer
  interpreters than the other to handle the same concurrency.  If you have
  10 requests at the same time, you need 10 interpreters.  There's no way
  speedycgi can do it with fewer, unless it actually makes some of them
  wait.  That could be happening, due to the fork-on-demand model, although
  your warmup round (priming the pump) should take care of that.

 What you say would be true if you had 10 processors and could get
 true concurrency.  But on single-cpu systems you usually don't need
 10 unix processes to handle 10 requests concurrently, since they get
 serialized by the kernel anyways.  I'll try to show how mod_perl handles
 10 concurrent requests, and compare that to mod_speedycgi so you can
 see the difference.

 For mod_perl, let's assume we have 10 httpd's, h1 through h10,
 when the 10 concurent requests come in.  h1 has aquired the mutex,
 and h2-h10 are waiting (in order) on the mutex.  Here's how the cpu
 actually runs the processes:

h1 accepts
h1 releases the mutex, making h2 runnable
h1 runs the perl code and produces the results
h1 waits for the mutex

h2 accepts
h2 releases the mutex, making h3 runnable
h2 runs the perl code and produces the results
h2 waits for the mutex

h3 accepts
...

 This is pretty straightforward.  Each of h1-h10 run the perl code
 exactly once.  They may not run exactly in this order since a process
 could get pre-empted, or blocked waiting to send data to the client,
 etc.  But regardless, each of the 10 processes will run the perl code
 exactly once.

 Here's the mod_speedycgi example - it too uses httpd's h1-h10, and they
 all take turns running the mod_speedycgi frontend code.  But the backends,
 where the perl code is, don't have to all be run fairly - they use MRU
 instead.  I'll use b1 and b2 to represent 2 speedycgi backend processes,
 already queued up in that order.

 Here's a possible speedycgi scenario:

h1 accepts
h1 releases the mutex, making h2 runnable
h1 sends a request to b1, making b1 runnable

h2 accepts
h2 releases the mutex, making h3 runnable
h2 sends a request to b2, making b2 runnable

b1 runs the perl code and sends the results to h1, making h1 runnable
b1 adds itself to the front of the queue

h3 accepts
h3 releases the mutex, making h4 runnable
h3 sends a request to b1, making b1 runnable

b2 runs the perl code and sends the results to h2, making h2 runnable
b2 adds itself to the front of the queue

h1 produces the results it got from b1
h1 waits for the mutex

h4 accepts
h4 releases the mutex, making h5 runnable
h4 sends a request to b2, making b2 runnable

b1 runs the perl code and sends the results to h3, making h3 runnable
b1 adds itself to the front of the queue

h2 produces the results it got from b2
h2 waits for the mutex

h5 accepts
h5 release the mutex, making h6 runnable
h5 sends a request to b1, making b1 runnable

b2 runs the perl code and sends the results to h4, making h4 runnable
b2 adds itself to the front of the queue

 This may be hard to follow, but hopefully you can see that the 10 httpd's
 just take turns using b1 and b2 over and over.  So, the 10 conncurrent
 requests end up being handled by just two perl backend processes.  Again,
 this is simplified.  If the perl processes get blocked, or pre-empted,
 you'll end up using more of them.  But generally, the LIFO will cause
 SpeedyCGI to sort-of settle into the smallest number of processes needed for
 the task.

 The difference between the two approaches is that the mod_perl
 implementation forces unix to use 10 separate perl processes, while the
 mod_speedycgi implementation sort-of decides on the fly how many
 different processes are needed.

Please let me know what you think I should change.  So far my
benchmarks only show one trend, but if you can tell me specifically
what I'm doing wrong (and it's something reasonable), I'll try it.
  
  Try setting MinSpareServers 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscriptsthat contain un-shared memory

2001-01-04 Thread Sam Horrocks

This is planned for a future release of speedycgi, though there will
probably be an option to set a maximum number of bytes that can be
bufferred before the frontend contacts a perl interpreter and starts
passing over the bytes.

Currently you can do this sort of acceleration with script output if you
use the "speedy" binary (not mod_speedycgi), and you set the BufsizGet option
high enough so that it's able to buffer all the output from your script.
The perl interpreter will then be able to detach and go handle other
requests while the frontend process waits for the output to drain.

  Perrin Harkins wrote:
   What I was saying is that it doesn't make sense for one to need fewer
   interpreters than the other to handle the same concurrency.  If you have
   10 requests at the same time, you need 10 interpreters.  There's no way
   speedycgi can do it with fewer, unless it actually makes some of them
   wait.  That could be happening, due to the fork-on-demand model, although
   your warmup round (priming the pump) should take care of that.
  
  I don't know if Speedy fixes this, but one problem with mod_perl v1 is that
  if, for instance, a large POST request is being uploaded, this takes a whole
  perl interpreter while the transaction is occurring. This is at least one
  place where a Perl interpreter should not be needed.
  
  Of course, this could be overcome if an HTTP Accelerator is used that takes
  the whole request before passing it to a local httpd, but I don't know of
  any proxies that work this way (AFAIK they all pass the packets as they
  arrive).



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Sam Horrocks

  Gunther Birznieks wrote:
   Sam just posted this to the speedycgi list just now.
  [...]
   The underlying problem in mod_perl is that apache likes to spread out
   web requests to as many httpd's, and therefore as many mod_perl interpreters,
   as possible using an LRU selection processes for picking httpd's.
  
  Hmmm... this doesn't sound right.  I've never looked at the code in
  Apache that does this selection, but I was under the impression that the
  choice of which process would handle each request was an OS dependent
  thing, based on some sort of mutex.
  
  Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html
  
  Doesn't that appear to be saying that whichever process gets into the
  mutex first will get the new request?

 I would agree that whichver process gets into the mutex first will get
 the new request.  That's exactly the problem I'm describing.  What you
 are describing here is first-in, first-out behaviour which implies LRU
 behaviour.

 Processes 1, 2, 3 are running.  1 finishes and requests the mutex, then
 2 finishes and requests the mutex, then 3 finishes and requests the mutex.
 So when the next three requests come in, they are handled in the same order:
 1, then 2, then 3 - this is FIFO or LRU.  This is bad for performance.

  In my experience running
  development servers on Linux it always seemed as if the the requests
  would continue going to the same process until a request came in when
  that process was already busy.

 No, they don't.  They go round-robin (or LRU as I say it).

 Try this simple test script:

 use CGI;
 my $cgi = CGI-new;
 print $cgi-header();
 print "mypid=$$\n";

 WIth mod_perl you constantly get different pids.  WIth mod_speedycgi you
 usually get the same pid.  THis is a really good way to see the LRU/MRU
 difference that I'm talking about.

 Here's the problem - the mutex in apache is implemented using a lock
 on a file.  It's left up to the kernel to decide which process to give
 that lock to.

 Now, if you're writing a unix kernel and implementing this file locking code,
 what implementation would you use?  Well, this is a general purpose thing -
 you have 100 or so processes all trying to acquire this file lock.  You could
 give out the lock randomly or in some ordered fashion.  If I were writing
 the kernel I would give it out in a round-robin fashion (or the
 least-recently-used process as I referred to it before).  Why?  Because
 otherwise one of those processes may starve waiting for this lock - it may
 never get the lock unless you do it in a fair (round-robin) manner.

 THe kernel doesn't know that all these httpd's are exactly the same.
 The kernel is implementing a general-purpose file-locking scheme and
 it doesn't know whether one process is more important than another.  If
 it's not fair about giving out the lock a very important process might
 starve.

 Take a look at fs/locks.c (I'm looking at linux 2.3.46).  In there is the
 comment:

 /* Insert waiter into blocker's block list.
  * We use a circular list so that processes can be easily woken up in
  * the order they blocked. The documentation doesn't require this but
  * it seems like the reasonable thing to do.
  */
 static void locks_insert_block(struct file_lock *blocker, struct file_lock *waiter)

  As I understand it, the implementation of "wake-one" scheduling in the
  2.4 Linux kernel may affect this as well.  It may then be possible to
  skip the mutex and use unserialized accept for single socket servers,
  which will definitely hand process selection over to the kernel.

 If the kernel implemented the queueing for multiple accepts using a LIFO
 instead of a FIFO and apache used this method instead of file locks,
 then that would probably solve it.

 Just found this on the net on this subject:
http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0455.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/9704.0/0453.html

   The problem is that at a high concurrency level, mod_perl is using lots
   and lots of different perl-interpreters to handle the requests, each
   with its own un-shared memory.  It's doing this due to its LRU design.
   But with SpeedyCGI's MRU design, only a few speedy_backends are being used
   because as much as possible it tries to use the same interpreter over and
   over and not spread out the requests to lots of different interpreters.
   Mod_perl is using lots of perl-interpreters, while speedycgi is only using
   a few.  mod_perl is requiring that lots of interpreters be in memory in
   order to handle the requests, wherase speedy only requires a small number
   of interpreters to be in memory.
  
  This test - building up unshared memory in each process - is somewhat
  suspect since in most setups I've seen, there is a very significant
  amount of memory being shared between mod_perl processes.

 My message and testing concerns un-shared memory only.  If all of your memory
 is shared, then there shouldn't be a 

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Sam Horrocks

  Folks, your discussion is not short of wrong statements that can be easily
  proved, but I don't find it useful.

 I don't follow.  Are you saying that my conclusions are wrong, but
 you don't want to bother explaining why?
 
 Would you agree with the following statement?

Under apache-1, speedycgi scales better than mod_perl with
scripts that contain un-shared memory 



Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Sam Horrocks

I've put your suggestion on the todo list.  It certainly wouldn't hurt to
have that feature, though I think memory sharing becomes a much much smaller
issue once you switch to MRU scheduling.

At the moment I think SpeedyCGI has more pressing needs though - for
example multiple scripts in a single interpreter, and an NT port.


  I think you could actually make speedycgi even better for shared memory 
  usage by creating a special directive which would indicate to speedycgi to 
  preload a series of modules. And then to tell speedy cgi to do forking of 
  that "master" backend preloaded module process and hand control over to 
  that forked process whenever you need to launch a new process.
  
  Then speedy would potentially have the best of both worlds.
  
  Sorry I cross posted your thing. But I do think it is a problem of mod_perl 
  also, and I am happily using speedycgi in production on at least one 
  commercial site where mod_perl could not be installed so easily because of 
  infrastructure issues.
  
  I believe your mechanism of round robining among MRU perl interpreters is 
  actually also accomplished by ActiveState's PerlEx (based on 
  Apache::Registry but using multithreaded IIS and pool of Interpreters). A 
  method similar to this will be used in Apache 2.0 when Apache is 
  multithreaded and therefore can control within program logic which Perl 
  interpeter gets called from a pool of Perl interpreters.
  
  It just isn't so feasible right now in Apache 1.0 to do this. And sometimes 
  people forget that mod_perl came about primarily for writing handlers in 
  Perl not as an application environment although it is very good for the 
  later as well.
  
  I think SpeedyCGI needs more advocacy from the mod_perl group because put 
  simply speedycgi is way easier to set up and use than mod_perl and will 
  likely get more PHP people using Perl again. If more people rely on Perl 
  for their fast websites, then you will get more people looking for more 
  power, and by extension more people using mod_perl.
  
  Whoops... here we go with the advocacy thing again.
  
  Later,
  Gunther
  
  At 02:50 AM 12/21/2000 -0800, Sam Horrocks wrote:
 Gunther Birznieks wrote:
  Sam just posted this to the speedycgi list just now.
 [...]
  The underlying problem in mod_perl is that apache likes to spread out
  web requests to as many httpd's, and therefore as many mod_perl 
   interpreters,
  as possible using an LRU selection processes for picking httpd's.

 Hmmm... this doesn't sound right.  I've never looked at the code in
 Apache that does this selection, but I was under the impression that the
 choice of which process would handle each request was an OS dependent
 thing, based on some sort of mutex.

 Take a look at this: http://httpd.apache.org/docs/misc/perf-tuning.html

 Doesn't that appear to be saying that whichever process gets into the
 mutex first will get the new request?
  
I would agree that whichver process gets into the mutex first will get
the new request.  That's exactly the problem I'm describing.  What you
are describing here is first-in, first-out behaviour which implies LRU
behaviour.
  
Processes 1, 2, 3 are running.  1 finishes and requests the mutex, then
2 finishes and requests the mutex, then 3 finishes and requests the mutex.
So when the next three requests come in, they are handled in the same order:
1, then 2, then 3 - this is FIFO or LRU.  This is bad for performance.
  
 In my experience running
 development servers on Linux it always seemed as if the the requests
 would continue going to the same process until a request came in when
 that process was already busy.
  
No, they don't.  They go round-robin (or LRU as I say it).
  
Try this simple test script:
  
use CGI;
my $cgi = CGI-new;
print $cgi-header();
print "mypid=$$\n";
  
WIth mod_perl you constantly get different pids.  WIth mod_speedycgi you
usually get the same pid.  THis is a really good way to see the LRU/MRU
difference that I'm talking about.
  
Here's the problem - the mutex in apache is implemented using a lock
on a file.  It's left up to the kernel to decide which process to give
that lock to.
  
Now, if you're writing a unix kernel and implementing this file locking 
   code,
what implementation would you use?  Well, this is a general purpose thing -
you have 100 or so processes all trying to acquire this file lock.  You 
   could
give out the lock randomly or in some ordered fashion.  If I were writing
the kernel I would give it out in a round-robin fashion (or the
least-recently-used process as I referred to it before).  Why?  Because
otherwise one of those processes may starve waiting for this lock - it may
never get the lock unless you do it in a fair (round-robin) manner.
  
THe

Re: Fwd: [speedycgi] Speedycgi scales better than mod_perl withscripts that contain un-shared memory

2000-12-21 Thread Sam Horrocks

I really wasn't trying to work backwards from a benchmark.  It was
more of an analysis of the design, and the benchmarks bore it out.
It's sort of like coming up with a theory in science - if you can't get
any experimental data to back up the theory, you're in big trouble.
But if you can at least point out the existence of some experiments
that are consistent with your theory, it means your theory may be true.

The best would be to have other people do the same tests and see if they
see the same trend.  If no-one else sees this trend, then I'd really
have to re-think my analysis.

Another way to look at it - as you say below MRU is going to be in
mod_perl-2.0.  ANd what is the reason for that?  If there's no performance
difference between LRU and MRU why would the author bother to switch
to MRU.  So, I'm saying there must be some benchmarks somewhere that
point out this difference - if there weren't any real-world difference,
why bother even implementing MRU.

I claim that my benchmarks point out this difference between MRU over
LRU, and that's why my benchmarks show better performance on speedycgi
than on mod_perl.

Sam

- SpeedyCGI uses MRU, mod_perl-2 will eventually use MRU.  
  On Thu, 21 Dec 2000, Sam Horrocks wrote:
  
 Folks, your discussion is not short of wrong statements that can be easily
 proved, but I don't find it useful.
   
I don't follow.  Are you saying that my conclusions are wrong, but
you don't want to bother explaining why?

Would you agree with the following statement?
   
   Under apache-1, speedycgi scales better than mod_perl with
   scripts that contain un-shared memory 
  
  I don't know. It's easy to give a simple example and claim being better.
  So far whoever tried to show by benchmarks that he is better, most often
  was proved wrong, since the technologies in question have so many
  features, that I believe no benchmark will prove any of them absolutely
  superior or inferior. Therefore I said that trying to tell that your grass
  is greener is doomed to fail if someone has time on his hands to prove you
  wrong. Well, we don't have this time.
  
  Therefore I'm not trying to prove you wrong or right. Gunther's point of
  the original forward was to show things that mod_perl may need to adopt to
  make it better. Doug already explained in his paper that the MRU approach
  has been already implemented in mod_perl-2.0. You could read it in the
  link that I've attached and the quote that I've quoted.
  
  So your conclusions about MRU are correct and we have it implemented
  already (well very soon now :). I apologize if my original reply was
  misleading.
  
  I'm not telling that benchmarks are bad. What I'm telling is that it's
  very hard to benchmark things which are different. You benefit the most
  from the benchmarking when you take the initial code/product, benchmark
  it, then you try to improve the code and benchmark again to see whether it
  gave you any improvement. That's the area when the benchmarks rule and
  their are fair because you test the same thing. Well you could read more
  of my rambling about benchmarks in the guide.
  
  So if you find some cool features in other technologies that mod_perl
  might adopt and benefit from, don't hesitate to tell the rest of the gang.
  
  
  
  Something that I'd like to comment on:
  
  I find it a bad practice to quote one sentence from person's post and
  follow up on it. Someone from the list has sent me this email (SB == me):
  
  SB I don't find it useful
  
  and follow up. Why not to use a single letter:
  
  SB I
  
  and follow up? It's so much easier to flame on things taken out of their
  context.
  
  it has been no once that people did this to each other here on the list, I
  think I did too. So please be more careful when taking things out of
  context. Thanks a lot, folks!
  
  Cheers...
  
  _
  Stas Bekman  JAm_pH --   Just Another mod_perl Hacker
  http://stason.org/   mod_perl Guide  http://perl.apache.org/guide 
  mailto:[EMAIL PROTECTED]   http://apachetoday.com http://logilune.com/
  http://singlesheaven.com http://perl.apache.org http://perlmonth.com/