(second attempt: pasting UTF8 into a iso8859-1 message or vice versa or something thoroughly messed up my reply. This time, it should work out ok. Thanks Stas for pointing this out)

At 13:26 -0800 1/9/04, Stas Bekman wrote:
Elizabeth Mattijsen wrote:
> I'm sure you know my PerlMonks article "Things yuu need to know before programming Perl ithreads" ( http://www.perlmonks.org/index.pl?node_id=288022 ).
> So yes, in general I think you can say that the data copied for each thread, quickly dwarves whatever optrees are shared.
How is this different from fork? When you fork, OS shares all memory pages between the parent and the child. As variables are modified, memory pages become dirty and unshared. With forking mutable (vars) and non-mutable (OpCodes) share the same memory pages, so ones a mutable variable changes, the opcodes allocated from the same memory page, get unshared too. So you get more and more memory unshared as you go. in the long run (unless you use size limiting tools) all the memory gets unshared.

Well, yes. But you forget that when you load module A, usually modules B..Z are loaded as well, hidden from your direct view. And Perl has always taken the approach of using more memory rather than more CPU. So most modules are actually optimized by their authors to store intermediate results in maybe not so intermediate variables. Not to mention, many modules build up internal data-structures that may never be altered. Even compile time constants need to have a CV in the stash where they exist, even though they're optimized away in the optree at compile time. And a CV qualifies as "data" as far as threads are concerned.



With ithreads, opcode tree is always shared and mutable data is copied at the very beginning. So your memory consumption should be exactly the same after the first request and after 1000's request (assuming that you don't allocate any memory at run time). Here you get more memory consumed at the beginning of the spawned thread, but it stays the same.

Well, I see it this way: With threads, you're going to get the hit for everything possible at the beginning. With fork, you get hit whenever anything _actually_ changes. And spread out over time. I would take fork() anytime over that.



So let's say you have 8MB Opcode Tree and 4MB of mutable data. The process totalling at 12MB. Using fork you will start off with all 12MB shared and get memory unshared as you go. With threads, you will start off with 4MB upfront memory consumption and it'll stay the same.

But if you start 100 threads, you'll 400 MByte, whereas fork 100 times, you'll start off witb basically 12 MByte and a bit. Its the _memory_ usage that is causing the problem.


On top of that, I think you will find quite the opposite in the amount of OpTree and mutable data usage. A typical case would easier be something like 4MB of optree and 8MB of mutable data.

To prove my point, I have taken my Benchmark::Thread::Size module (available from CPAN) and tested the behaviour of POSIX with and without anything exported.

Performing each test 5 times
  #   (ref)        none         all
  0    1726        +129        +708
  1    2080        +256       +1468
  2    2368        +332       +2060
  5    3232        +572       +3824
 10    4656        +980       +6788
 20    7512       +1796      +12706
 50   16084       +4232      +30454
100   30380       +8284      +60023

==== none ========
use POSIX ();

==== all =========
use POSIX;

==================

Sizes are displayed in Kbytes. The average of 5 runs is shown. Each line shows memory used for the number of threads in column 1.

The second column shows the reference memory usage: the "bare" threads case. You can see that 100 bare threads take about 30 MByte of memory. That's without _anyting_ apart from "use threads". No other modules (at least not visible: you'd be amazed what actually gets loaded when you do a "use threads"). You see that each thread takes about 300 K extra (which more or less coincides with my Devel::Size benchmark the other day).

The third column shows the _extra_ memory needed when a "use POSIX()" is added.

The fourth column shows what happens when all constants and subs are exported with "use POSIX".

Now I realize that many of the exported subs would otherwize have been AUTOLOADed. But still, then they would _only_ create a sub, an optree. And here you clearly.the effect the exported subs (which have a CV slot in the stash) have on the memory usage of a thread.

It is _very_ easy to use a _lot_ of memory this way.

Another example: one optimized away constant subroutine versus 2 optimized away constant subroutines:

Performing each test 5 times
(ref)  5 100
none  5 100
all  5 100
  #   (ref)        none         all
  0    1728          -2          -2
  1    2080          +4          +8
  2    2368          +4          +8
  5    3232          +4          +8
 10    4656         +20         +24
 20    7512         +40         +48
 50   16084        +104        +116
100   30380        +200        +224

==== none ========
sub foo () {1}

==== all =========
sub foo () {1}
sub bar () {1}

==================

You can see that 2 constant subroutines take up more thread memory than 1. And that's just because of the CV that stays behind in the package stash.

In case you're not sure the subs get optimized away, look at these two optrees:

$ perl5.8.2-threaded -MO=Concise -e 'sub foo () { 1 }; foo'
3  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 2 -e:1) v ->3
-     <0> ex-const v ->3

$ perl5.8.2-threaded -MO=Concise -e 'sub foo { 1 }; foo'
6  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 2 -e:1) v ->3
5     <1> entersub[t2] vKS/TARG,1 ->6
-        <1> ex-list K ->5
3           <0> pushmark s ->4
-           <1> ex-rv2cv sK/129 ->-
4              <#> gv[*foo] s ->5


Now if in your fork setup you braket at 8MB with a size limiting tool to restart, you will get the same 4MB overhead per process. Besides equal memory usage you get better run-time performance with threads, because it doesn't need to copy dirty pages as with forks (everything was done at the perl_clone, which can be arranged long before the request is served)

That may be so... but you will find that along with huge memory usage you will also burn a lot of CPU because of all the stash walking that happens during cloning.



(and you get a slowdown at the same time because of context management).
So, as you can see it's quite possible that threads will perform better than forks and consume equal or less amount of memory if the opcode tree is bigger than the mutable data.

Probably. But _only_ if you can get the desired number of threads into your physical RAM _and_ you can live with possible _long_ startup times (getting into 10's of seconds, depending on number of modules loaded and hardware you're running on).



Liz

--
Reporting bugs: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html



Reply via email to