Re: Performance of parallel computing.
You're damn right here. First of all, I must admit that I've misinterpreted the benchmark results (guilty). Yet, anyway, I think I know what's really happening here. To make things really clear I ran the benchmark for all number of workers from 1 to 9. Here is a cleaned up output: Timing 1 iterations of worker1, worker2, worker3, worker4, worker5, worker6, worker7, worker8, worker9... worker1: 22.125 wallclock secs (22.296 usr 0.248 sys 22.544 cpu) @ 0.045/s (n=1) worker2: 12.554 wallclock secs (24.221 usr 0.715 sys 24.936 cpu) @ 0.080/s (n=1) worker3: 9.330 wallclock secs (25.708 usr 1.316 sys 27.024 cpu) @ 0.107/s (n=1) worker4: 8.221 wallclock secs (28.151 usr 2.676 sys 30.827 cpu) @ 0.122/s (n=1) worker5: 7.131 wallclock secs (30.395 usr 3.658 sys 34.053 cpu) @ 0.140/s (n=1) worker6: 7.180 wallclock secs (34.496 usr 4.479 sys 38.975 cpu) @ 0.139/s (n=1) worker7: 7.050 wallclock secs (38.267 usr 5.453 sys 43.720 cpu) @ 0.142/s (n=1) worker8: 6.668 wallclock secs (41.607 usr 5.586 sys 47.194 cpu) @ 0.150/s (n=1) worker9: 7.220 wallclock secs (46.762 usr 11.647 sys 58.409 cpu) @ 0.139/s (n=1) O-O--O-O | | s/iter | worker1 | O=O==O=O | worker1 | 22125229 | -- | | worker2 | 12554094 | 76% | | worker3 | 9329865 | 137%| | worker4 | 8221486 | 169%| | worker5 | 7130758 | 210%| | worker6 | 7180343 | 208%| | worker7 | 7049935 | 214%| | worker8 | 6667794 | 232%| | worker9 | 7219864 | 206%| The plateau is there but it's been reached even before we ran out of all the available cores: 5 workers takes all of the CPU power already. Yet, the speedup achieved is really much less that it'd expected... But then I realized that there is another player on the field: throttling. And that actually makes any other measurements useless on my notebook. This is also an answer to Parrot's suggestion about possible caches involvement: that's not it, for sure. Especially if we take into account that the numbers were +/- the same on every benchmark run. > On Dec 7, 2018, at 12:04, yary wrote: > > OK... going back to the hypothesis in the OP > >> The plateau is seemingly defined by the number of cores or, more correctly, >> by the number of supported threads. > > This suggests that the benchmark is CPU-bound, which is supported by > your more recent observation "100% load for a single one" > > Also, you mentioned running MacOS with two threads per core, which > implies Intel's hyperthreading. Depending on the workload, CPU-bound > processes sharing a hyperthreaded core see a speedup of 0-30%, as > opposed to running on separate cores which can give a speedup of 100%. > (Back when I searched for large primes, HT gave a 25% speed boost.) So > with 6 cores, 2 HT per core, I would expect a max parallel boost of 6 > * (1x +0.30x) = 7.8x - and your test is only giving half that. > > -y > Best regards, Vadim Belman
Re: Performance of parallel computing.
OK... going back to the hypothesis in the OP > The plateau is seemingly defined by the number of cores or, more correctly, > by the number of supported threads. This suggests that the benchmark is CPU-bound, which is supported by your more recent observation "100% load for a single one" Also, you mentioned running MacOS with two threads per core, which implies Intel's hyperthreading. Depending on the workload, CPU-bound processes sharing a hyperthreaded core see a speedup of 0-30%, as opposed to running on separate cores which can give a speedup of 100%. (Back when I searched for large primes, HT gave a 25% speed boost.) So with 6 cores, 2 HT per core, I would expect a max parallel boost of 6 * (1x +0.30x) = 7.8x - and your test is only giving half that. -y
Re: Performance of parallel computing.
Is it possible that OS caching is having an effect on the performance? It's sometimes necessary to run the same code several times before it settles down to consistent results. On 12/7/18, Vadim Belman wrote: > There is not need for filling in the channel prior to starting workers. > First of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my > system. I didn't do benchmarking of the generator thread, but considering > that even your timing gives 0.054sec/per string – I will most definitely > remain fast enough to provide all workers with data. But even with this in > mind I re-run the test with only 100 characters long strings being > generated. Here is what I've got: > > Benchmark: > Timing 1 iterations of workers1, workers10, workers15, workers2, workers3, > workers5... > workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @ > 0.044/s (n=1) > (warning: too few iterations for a reliable count) > workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @ > 0.162/s (n=1) > (warning: too few iterations for a reliable count) > workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.162/s > (n=1) > (warning: too few iterations for a reliable count) > workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @ > 0.071/s (n=1) > (warning: too few iterations for a reliable count) > workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @ > 0.095/s (n=1) > (warning: too few iterations for a reliable count) > workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s > (n=1) > (warning: too few iterations for a reliable count) > O---O--O--O---O--O---O--O--O > | | s/iter | workers3 | workers15 | workers5 | workers10 | > workers2 | workers1 | > O===O==O==O===O==O===O==O==O > | workers3 | 10553022 | -- | -42% | -28% | -42% | 34% >| 113% | > | workers15 | 6165235 | 71% | --| 24% | -0% | 129% >| 265% | > | workers5 | 7650413 | 38% | -19% | -- | -20% | 84% >| 194% | > | workers10 | 6154300 | 71% | 0%| 24% | --| 129% >| 265% | > | workers2 | 14101512 | -25% | -56% | -46% | -56% | -- >| 59% | > | workers1 | 22473185 | -53% | -73% | -66% | -73% | -37% >| -- | > > > What's more important is the observation for the CPU consumption by the moar > process. Depending on the number of workers I was getting numbers from 100% > load for a single one up to 1000% for the whole bunch of 15. This perfectly > corresponds with 6 cores/2 threads per core of my CPU. > >> On Dec 7, 2018, at 02:06, yary wrote: >> >> That was a bit vague- meant that I suspect the workers are being >> starved, since you have many consumers, and only a single thread >> generating the 1k strings. I would prime the channel to be full - or >> other restructuring the ensure all threads are kept busy. >> >> -y >> >> On Thu, Dec 6, 2018 at 10:56 PM yary wrote: >>> >>> Not sure if your test is measuring what you expect- the setup of >>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's >>> reducing the apparent effect of parllelism. >>> >>> $ perl6 >>> To exit type 'exit' or '^D' my $c = Channel.new; >>> Channel.new { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say now - ENTER now; } >>> 2.7289092 >>> >>> I'd move the setup outside the "cmpthese" and try again, re-think the >>> new results. >>> >>> >>> >>> On 12/6/18, Vadim Belman wrote: Hi everybody! I have recently played a bit with somewhat intense computations and tried to parallelize them among a couple of threaded workers. The results were somewhat... eh... discouraging. To sum up my findings I wrote a simple demo benchmark: use Digest::SHA; use Bench; sub worker ( Str:D $str ) { my $digest = $str; for 1..100 { $digest = sha256 $digest; } } sub run ( Int $workers ) { my $c = Channel.new; my @w; @w.push: start { for 1..50 { $c.send( (1..1024).map( { (' '..'Z').pick } ).join ); } LEAVE $c.close; } for 1..$workers { @w.push: start { react { whenever $c -> $str { worker( $str ); } }
Re: Performance of parallel computing.
There is not need for filling in the channel prior to starting workers. First of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my system. I didn't do benchmarking of the generator thread, but considering that even your timing gives 0.054sec/per string – I will most definitely remain fast enough to provide all workers with data. But even with this in mind I re-run the test with only 100 characters long strings being generated. Here is what I've got: Benchmark: Timing 1 iterations of workers1, workers10, workers15, workers2, workers3, workers5... workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @ 0.044/s (n=1) (warning: too few iterations for a reliable count) workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @ 0.162/s (n=1) (warning: too few iterations for a reliable count) workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.162/s (n=1) (warning: too few iterations for a reliable count) workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @ 0.071/s (n=1) (warning: too few iterations for a reliable count) workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @ 0.095/s (n=1) (warning: too few iterations for a reliable count) workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s (n=1) (warning: too few iterations for a reliable count) O---O--O--O---O--O---O--O--O | | s/iter | workers3 | workers15 | workers5 | workers10 | workers2 | workers1 | O===O==O==O===O==O===O==O==O | workers3 | 10553022 | -- | -42% | -28% | -42% | 34% | 113% | | workers15 | 6165235 | 71% | --| 24% | -0% | 129% | 265% | | workers5 | 7650413 | 38% | -19% | -- | -20% | 84% | 194% | | workers10 | 6154300 | 71% | 0%| 24% | --| 129% | 265% | | workers2 | 14101512 | -25% | -56% | -46% | -56% | -- | 59% | | workers1 | 22473185 | -53% | -73% | -66% | -73% | -37% | -- | What's more important is the observation for the CPU consumption by the moar process. Depending on the number of workers I was getting numbers from 100% load for a single one up to 1000% for the whole bunch of 15. This perfectly corresponds with 6 cores/2 threads per core of my CPU. > On Dec 7, 2018, at 02:06, yary wrote: > > That was a bit vague- meant that I suspect the workers are being > starved, since you have many consumers, and only a single thread > generating the 1k strings. I would prime the channel to be full - or > other restructuring the ensure all threads are kept busy. > > -y > > On Thu, Dec 6, 2018 at 10:56 PM yary wrote: >> >> Not sure if your test is measuring what you expect- the setup of >> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's >> reducing the apparent effect of parllelism. >> >> $ perl6 >> To exit type 'exit' or '^D' >>> my $c = Channel.new; >> Channel.new >>> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say now >>> - ENTER now; } >> 2.7289092 >> >> I'd move the setup outside the "cmpthese" and try again, re-think the >> new results. >> >> >> >> On 12/6/18, Vadim Belman wrote: >>> Hi everybody! >>> >>> I have recently played a bit with somewhat intense computations and tried to >>> parallelize them among a couple of threaded workers. The results were >>> somewhat... eh... discouraging. To sum up my findings I wrote a simple demo >>> benchmark: >>> >>> use Digest::SHA; >>> use Bench; >>> >>> sub worker ( Str:D $str ) { >>> my $digest = $str; >>> >>> for 1..100 { >>> $digest = sha256 $digest; >>> } >>> } >>> >>> sub run ( Int $workers ) { >>> my $c = Channel.new; >>> >>> my @w; >>> @w.push: start { >>> for 1..50 { >>> $c.send( >>> (1..1024).map( { (' '..'Z').pick } ).join >>> ); >>> } >>> LEAVE $c.close; >>> } >>> >>> for 1..$workers { >>> @w.push: start { >>> react { >>> whenever $c -> $str { >>> worker( $str ); >>> } >>> } >>> } >>> } >>> >>> await @w; >>> } >>> >>> my $b = Bench.new; >>> $b.cmpthese( >>> 1, >>> { >>> workers1 => sub { run( 1 ) }, >>> workers5 => sub { run( 5 ) }, >>> workers10 => sub { run( 10 ) }, >>> workers15 => sub { r