Leopold Toetsch wrote:

Gordon Henriksen wrote:

I overstated when I said that morph must die. morph could live IF:

[ long proposal ]


Increasing the union size, so that each pointer is distinct is not an option. This imposes considerable overhead on a non-threaded program too, due its bigger PMC size.

That was the brute-force approach, separating out all pointers. If the scalar hierarchy doesn't use all 4 of the pointers, then the bloat can be reduced. So long as a morph can't make a memory location which was formerly a pointer variable point to a new type, or reuse the memory location for data--then pointers can be part of a union without violating the principal of operation.


And what of the per-PMC mutex? Is that not also considerable overhead? More than an unused field, even.

To keep internal state consistent we have to LOCK shared PMCs, that's it. This locking is sometimes necessary for reading too.

Sometimes? Unless parrot can prove a PMC is not shared, PMC locking is ALWAYS necessary for ALL accesses to ANY PMC. (This is easy to show; for any thread, hypothesize a preemption and morph at an inconvenient PC. Some morph will always put an int where a pointer was expected, or change the type of a pointer.)



We seem to be painting ourselves into a corner here with impossible constraints. Clearly, adding size to PMCs is undesirable, and I recognize that you put a lot of work into shrinking the PMC struct over a Perl 5 scalar. But, fresh from CVS:


BENCHMARK                         USER     SYS  %CPU    TOTAL
------------------------------ ------- ------- ----- --------
addit.imc                       23.53s   0.05s   96%   24.517
addit2.imc                      15.97s   0.07s   98%   16.258
arriter.imc                      0.03s   0.03s   37%    0.159
arriter_o1.imc                   0.02s   0.04s   37%    0.160
fib.imc                          4.41s   3.07s   97%    7.691
addit.pasm                      17.84s   0.05s   95%   18.827
bench_newp.pasm                  4.53s   0.16s   97%    4.824
freeze.pasm                      1.47s   0.18s   82%    1.991
gc_alloc_new.pasm                0.20s   0.41s   72%    0.836
gc_alloc_reuse.pasm             19.09s  13.15s   98%   32.731
gc_generations.pasm             14.02s   5.42s   95%   20.424
gc_header_new.pasm               7.91s   1.37s   97%    9.558
gc_header_reuse.pasm            11.45s   0.09s   98%   11.733
gc_waves_headers.pasm            3.05s   0.40s   95%    3.597
gc_waves_sizeable_data.pasm      2.22s   3.81s   97%    6.200
gc_waves_sizeable_headers.pasm   7.48s   1.78s   97%    9.480
hash-utf8.pasm                   7.02s   0.07s   95%    7.404
primes.pasm                     61.63s   0.10s   97% 1:03.300
primes2.pasm                    39.26s   0.09s   97%   40.189
primes2_p.pasm                  69.00s   0.17s   96% 1:11.840
stress.pasm                      2.19s   0.28s   91%    2.705
stress1.pasm                    46.95s   1.33s   95%   50.377
stress2.pasm                     8.63s   0.24s   98%    9.026
stress3.pasm                    26.19s   0.78s   96%   28.063
TOTAL                                                7:21.890

And with a PMC struct that's bloated by 16 bytes as proposed:

BENCHMARK                         USER     SYS  %CPU    TOTAL
------------------------------ ------- ------- ----- --------
addit.imc                       23.67s   0.10s   96%   24.745
addit2.imc                      16.05s   0.07s   98%   16.414
arriter.imc                      0.03s   0.03s   44%    0.135
arriter_o1.imc                   0.03s   0.04s   48%    0.144
fib.imc                          4.47s   2.99s   86%    8.651
addit.pasm                      18.08s   0.08s   98%   18.442
bench_newp.pasm                  4.73s   0.18s   97%    5.019
freeze.pasm                      1.56s   0.29s   89%    2.075
gc_alloc_new.pasm                0.25s   0.35s   89%    0.673
gc_alloc_reuse.pasm             18.90s  13.58s   96%   33.642
gc_generations.pasm             13.84s   5.80s   98%   19.942
gc_header_new.pasm               7.98s   1.29s   97%    9.492
gc_header_reuse.pasm            11.37s   0.05s   98%   11.552
gc_waves_headers.pasm            3.09s   0.33s   96%    3.538
gc_waves_sizeable_data.pasm      1.98s   4.01s   96%    6.207
gc_waves_sizeable_headers.pasm   7.67s   1.60s   91%   10.112
hash-utf8.pasm                   7.03s   0.05s   97%    7.287
primes.pasm                     61.68s   0.15s   98% 1:02.800
primes2.pasm                    39.17s   0.12s   99%   39.553
primes2_p.pasm                  69.01s   0.18s   96% 1:11.640
stress.pasm                      2.32s   0.43s   96%    2.840
stress1.pasm                    50.85s   1.79s   93%   56.189
stress2.pasm                     9.18s   0.32s   98%    9.689
stress3.pasm                    29.06s   1.23s   98%   30.748
TOTAL                                                7:31.527

That's only 2.1%.

What do you think the overall performance effect of fine-grained locking will be? You just showed in a microbenchmark that it's 400% for some operations. We've also heard anecdotal evidence of 400% *overall* performance hits from similar threading strategies in other projects. And remember, these overheads are ON TOP OF the user's synchronization requirements; the PMC locks will rarely coincide with the user's high-level synchronization requirements.

If these are the two options, I as a user would rather have a separate threaded parrot executable which takes the 2.1% hit, rather than the 400% overhead as per above. It's easily the difference between usable threads and YAFATTP (yet another failed attempt to thread perl).



Gordon Henriksen
[EMAIL PROTECTED]



Reply via email to