At 13:26 -0800 1/9/04, Stas Bekman wrote:
Elizabeth Mattijsen wrote:
I'm sure you know my PerlMonks article "Things yuu need to know
before programming Perl ithreads" (
http://www.perlmonks.org/index.pl?node_id=288022 ).
So yes, in general I think you can say that the data copied for
each thread, quickly dwarves whatever optrees are shared.
How is this different from fork? When you fork, OS shares all memory
pages between the parent and the child. As variables are modified,
memory pages become dirty and unshared. With forking mutable (vars)
and non-mutable (OpCodes) share the same memory pages, so ones a
mutable variable changes, the opcodes allocated from the same memory
page, get unshared too. So you get more and more memory unshared as
you go. in the long run (unless you use size limiting tools) all the
memory gets unshared.

Well, yes. But you forget that when you load module A, usually modules B..Z are loaded as well, hidden from your direct view. And Perl has always taken the approach of using more memory rather than more CPU. So most modules are actually optimized by their authors to store intermediate results in maybe not so intermediate variables. Not to mention, many modules build up internal data-structures that may never be altered. Even compile time constants need to have a CV in the stash where they exist, even though they're optimized away in the optree at compile time. And a CV qualifies as "data" as far as threads are concerned.


With ithreads, opcode tree is always shared and mutable data is
copied at the very beginning. So your memory consumption should be
exactly the same after the first request and after 1000's request
(assuming that you don't allocate any memory at run time). Here you
get more memory consumed at the beginning of the spawned thread, but
it stays the same.

Well, I see it this way: With threads, you're going to get the hit for everything possible at the beginning. With fork, you get hit whenever anything _actually_ changes. And spread out over time. I would take fork() anytime over that.


So let's say you have 8MB Opcode Tree and 4MB of mutable data. The
process totalling at 12MB. Using fork you will start off with all
12MB shared and get memory unshared as you go. With threads, you
will start off with 4MB upfront memory consumption and it'll stay
the same.

But if you start 100 threads, you'll 400 MByte, whereas fork 100 times, you'll start off witb basically 12 MByte and a bit. Its the _memory_ usage that is causing the problem.

On top of that, I think you will find quite the opposite in the
amount of OpTree and mutable data usage.  A typical case would easier
be something like 4MB of optree and 8MB of mutable data.

To prove my point, I have taken my Benchmark::Thread::Size module
(available from CPAN) and tested the behaviour of POSIX with and
without anything exported.

Performing each test 5 times
  #   (ref)         none         all 
                                  
  0    1724        +134 ± 6    +710 ± 6
  1    2080        +258 ± 6   +1468    
  2    2368        +334 ± 6   +2060    
  5    3232        +572       +3824    
 10    4656        +980       +6788    
 20    7512       +1796      +12704    
 50   16087 ± 6   +4228      +30448    
100   30380       +8284 ± 2  +60024    

==== none
==================================================
use POSIX ();

==== all
====================================================
use POSIX;

==========================================================

Sizes are displayed in Kbytes.  The average of 5 runs is shown, with
differences from the mean shown as ±N after that (if any difference
was found).  Each line shows memory used for the number of threads in
column 1.

The second column shows the reference memory usage: the "bare"
threads case.  You can see that 100 bare threads take about 30 MByte
of memory.  That's without _anyting_ apart from "use threads".  No
other modules (at least not visible: you'd be amazed what actually
gets loaded when you do a "use threads").  You see that each thread
takes about 300 K extra (which more or less coincides with my
Devel::Size benchmark the other day).

The third column shows the _extra_ memory needed when a "use POSIX()" is added.

The fourth column shows what happens when all constants and subs are
exported with "use POSIX".

Now I realize that many of the exported subs would otherwize have
been AUTOLOADed.  But still, then they would _only_ create a sub, an
optree.  And here you clearly.the effect the exported subs (which
have a CV slot in the stash) have on the memory usage of a thread.

It is _very_ easy to use a _lot_ of memory this way.

Another example: one optimized away constant subroutine versus 2
optimized away constant subroutines:

Performing each test 5 times
  #   (ref)         one        two  
                                 
  0    1726 ± 6      -2          +0 ± 6
  1    2080          +4         +10 ± 6
  2    2368          +4         +10 ± 6
  5    3232          +4         +10 ± 6
 10    4656         +20         +24    
 20    7512         +40         +48    
 50   16087 ± 8    +100        +112    
100   30380 ± 2    +199        +223    

==== all
====================================================
sub foo () { 1 }

==== none
==================================================
sub foo () { 1 }
sub bar () { 1 }

==========================================================

You can see that 2 constant subroutines take up more thread memory
than 1.  And that's just because of the CV that stays behind in the
package stash.

In case you're not sure the subs get optimized away, look at these two optrees:

$ perl5.8.2-threaded -MO=Concise -e 'sub foo () { 1 }; foo'
3  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 2 -e:1) v ->3
-     <0> ex-const v ->3

$ perl5.8.2-threaded -MO=Concise -e 'sub foo { 1 }; foo'
6  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 2 -e:1) v ->3
5     <1> entersub[t2] vKS/TARG,1 ->6
-        <1> ex-list K ->5
3           <0> pushmark s ->4
-           <1> ex-rv2cv sK/129 ->-
4              <#> gv[*foo] s ->5


Now if in your fork setup you braket at 8MB with a size limiting
tool to restart, you will get the same 4MB overhead per process.
Besides equal memory usage you get better run-time performance with
threads, because it doesn't need to copy dirty pages as with forks
(everything was done at the perl_clone, which can be arranged long
before the request is served)

That may be so... but you will find that along with huge memory usage you will also burn a lot of CPU because of all the stash walking that happens during cloning.


(and you get a slowdown at the same time because of context management).
So, as you can see it's quite possible that threads will perform
better than forks and consume equal or less amount of memory if the
opcode tree is bigger than the mutable data.

Probably. But _only_ if you can get the desired number of threads into your physical RAM _and_ you can live with possible _long_ startup times (getting into 10's of seconds, depending on number of modules loaded and hardware you're running on).



Liz

--
Reporting bugs: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html



Reply via email to