Gordon Henriksen wrote:
I overstated when I said that morph must die. morph could live IF:
[ long proposal ]
Increasing the union size, so that each pointer is distinct is not an option. This imposes considerable overhead on a non-threaded program too, due its bigger PMC size.
That was the brute-force approach, separating out all pointers. If the scalar hierarchy doesn't use all 4 of the pointers, then the bloat can be reduced. So long as a morph can't make a memory location which was formerly a pointer variable point to a new type, or reuse the memory location for data--then pointers can be part of a union without violating the principal of operation.
And what of the per-PMC mutex? Is that not also considerable overhead? More than an unused field, even.
To keep internal state consistent we have to LOCK shared PMCs, that's it. This locking is sometimes necessary for reading too.
Sometimes? Unless parrot can prove a PMC is not shared, PMC locking is ALWAYS necessary for ALL accesses to ANY PMC. (This is easy to show; for any thread, hypothesize a preemption and morph at an inconvenient PC. Some morph will always put an int where a pointer was expected, or change the type of a pointer.)
We seem to be painting ourselves into a corner here with impossible constraints. Clearly, adding size to PMCs is undesirable, and I recognize that you put a lot of work into shrinking the PMC struct over a Perl 5 scalar. But, fresh from CVS:
BENCHMARK USER SYS %CPU TOTAL ------------------------------ ------- ------- ----- -------- addit.imc 23.53s 0.05s 96% 24.517 addit2.imc 15.97s 0.07s 98% 16.258 arriter.imc 0.03s 0.03s 37% 0.159 arriter_o1.imc 0.02s 0.04s 37% 0.160 fib.imc 4.41s 3.07s 97% 7.691 addit.pasm 17.84s 0.05s 95% 18.827 bench_newp.pasm 4.53s 0.16s 97% 4.824 freeze.pasm 1.47s 0.18s 82% 1.991 gc_alloc_new.pasm 0.20s 0.41s 72% 0.836 gc_alloc_reuse.pasm 19.09s 13.15s 98% 32.731 gc_generations.pasm 14.02s 5.42s 95% 20.424 gc_header_new.pasm 7.91s 1.37s 97% 9.558 gc_header_reuse.pasm 11.45s 0.09s 98% 11.733 gc_waves_headers.pasm 3.05s 0.40s 95% 3.597 gc_waves_sizeable_data.pasm 2.22s 3.81s 97% 6.200 gc_waves_sizeable_headers.pasm 7.48s 1.78s 97% 9.480 hash-utf8.pasm 7.02s 0.07s 95% 7.404 primes.pasm 61.63s 0.10s 97% 1:03.300 primes2.pasm 39.26s 0.09s 97% 40.189 primes2_p.pasm 69.00s 0.17s 96% 1:11.840 stress.pasm 2.19s 0.28s 91% 2.705 stress1.pasm 46.95s 1.33s 95% 50.377 stress2.pasm 8.63s 0.24s 98% 9.026 stress3.pasm 26.19s 0.78s 96% 28.063 TOTAL 7:21.890
And with a PMC struct that's bloated by 16 bytes as proposed:
BENCHMARK USER SYS %CPU TOTAL ------------------------------ ------- ------- ----- -------- addit.imc 23.67s 0.10s 96% 24.745 addit2.imc 16.05s 0.07s 98% 16.414 arriter.imc 0.03s 0.03s 44% 0.135 arriter_o1.imc 0.03s 0.04s 48% 0.144 fib.imc 4.47s 2.99s 86% 8.651 addit.pasm 18.08s 0.08s 98% 18.442 bench_newp.pasm 4.73s 0.18s 97% 5.019 freeze.pasm 1.56s 0.29s 89% 2.075 gc_alloc_new.pasm 0.25s 0.35s 89% 0.673 gc_alloc_reuse.pasm 18.90s 13.58s 96% 33.642 gc_generations.pasm 13.84s 5.80s 98% 19.942 gc_header_new.pasm 7.98s 1.29s 97% 9.492 gc_header_reuse.pasm 11.37s 0.05s 98% 11.552 gc_waves_headers.pasm 3.09s 0.33s 96% 3.538 gc_waves_sizeable_data.pasm 1.98s 4.01s 96% 6.207 gc_waves_sizeable_headers.pasm 7.67s 1.60s 91% 10.112 hash-utf8.pasm 7.03s 0.05s 97% 7.287 primes.pasm 61.68s 0.15s 98% 1:02.800 primes2.pasm 39.17s 0.12s 99% 39.553 primes2_p.pasm 69.01s 0.18s 96% 1:11.640 stress.pasm 2.32s 0.43s 96% 2.840 stress1.pasm 50.85s 1.79s 93% 56.189 stress2.pasm 9.18s 0.32s 98% 9.689 stress3.pasm 29.06s 1.23s 98% 30.748 TOTAL 7:31.527
That's only 2.1%.
What do you think the overall performance effect of fine-grained locking will be? You just showed in a microbenchmark that it's 400% for some operations. We've also heard anecdotal evidence of 400% *overall* performance hits from similar threading strategies in other projects. And remember, these overheads are ON TOP OF the user's synchronization requirements; the PMC locks will rarely coincide with the user's high-level synchronization requirements.
If these are the two options, I as a user would rather have a separate threaded parrot executable which takes the 2.1% hit, rather than the 400% overhead as per above. It's easily the difference between usable threads and YAFATTP (yet another failed attempt to thread perl).
—
Gordon Henriksen [EMAIL PROTECTED]