Is it possible that OS caching is having an effect on the performance? It's sometimes necessary to run the same code several times before it settles down to consistent results.
On 12/7/18, Vadim Belman <[email protected]> wrote: > There is not need for filling in the channel prior to starting workers. > First of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my > system. I didn't do benchmarking of the generator thread, but considering > that even your timing gives 0.054sec/per string – I will most definitely > remain fast enough to provide all workers with data. But even with this in > mind I re-run the test with only 100 characters long strings being > generated. Here is what I've got: > > Benchmark: > Timing 1 iterations of workers1, workers10, workers15, workers2, workers3, > workers5... > workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @ > 0.044/s (n=1) > (warning: too few iterations for a reliable count) > workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @ > 0.162/s (n=1) > (warning: too few iterations for a reliable count) > workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.162/s > (n=1) > (warning: too few iterations for a reliable count) > workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @ > 0.071/s (n=1) > (warning: too few iterations for a reliable count) > workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @ > 0.095/s (n=1) > (warning: too few iterations for a reliable count) > workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s > (n=1) > (warning: too few iterations for a reliable count) > O-----------O----------O----------O-----------O----------O-----------O----------O----------O > | | s/iter | workers3 | workers15 | workers5 | workers10 | > workers2 | workers1 | > O===========O==========O==========O===========O==========O===========O==========O==========O > | workers3 | 10553022 | -- | -42% | -28% | -42% | 34% > | 113% | > | workers15 | 6165235 | 71% | -- | 24% | -0% | 129% > | 265% | > | workers5 | 7650413 | 38% | -19% | -- | -20% | 84% > | 194% | > | workers10 | 6154300 | 71% | 0% | 24% | -- | 129% > | 265% | > | workers2 | 14101512 | -25% | -56% | -46% | -56% | -- > | 59% | > | workers1 | 22473185 | -53% | -73% | -66% | -73% | -37% > | -- | > -------------------------------------------------------------------------------------------- > > What's more important is the observation for the CPU consumption by the moar > process. Depending on the number of workers I was getting numbers from 100% > load for a single one up to 1000% for the whole bunch of 15. This perfectly > corresponds with 6 cores/2 threads per core of my CPU. > >> On Dec 7, 2018, at 02:06, yary <[email protected]> wrote: >> >> That was a bit vague- meant that I suspect the workers are being >> starved, since you have many consumers, and only a single thread >> generating the 1k strings. I would prime the channel to be full - or >> other restructuring the ensure all threads are kept busy. >> >> -y >> >> On Thu, Dec 6, 2018 at 10:56 PM yary <[email protected]> wrote: >>> >>> Not sure if your test is measuring what you expect- the setup of >>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's >>> reducing the apparent effect of parllelism. >>> >>> $ perl6 >>> To exit type 'exit' or '^D' >>>> my $c = Channel.new; >>> Channel.new >>>> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say >>>> now - ENTER now; } >>> 2.7289092 >>> >>> I'd move the setup outside the "cmpthese" and try again, re-think the >>> new results. >>> >>> >>> >>> On 12/6/18, Vadim Belman <[email protected]> wrote: >>>> Hi everybody! >>>> >>>> I have recently played a bit with somewhat intense computations and >>>> tried to >>>> parallelize them among a couple of threaded workers. The results were >>>> somewhat... eh... discouraging. To sum up my findings I wrote a simple >>>> demo >>>> benchmark: >>>> >>>> use Digest::SHA; >>>> use Bench; >>>> >>>> sub worker ( Str:D $str ) { >>>> my $digest = $str; >>>> >>>> for 1..100 { >>>> $digest = sha256 $digest; >>>> } >>>> } >>>> >>>> sub run ( Int $workers ) { >>>> my $c = Channel.new; >>>> >>>> my @w; >>>> @w.push: start { >>>> for 1..50 { >>>> $c.send( >>>> (1..1024).map( { (' '..'Z').pick } ).join >>>> ); >>>> } >>>> LEAVE $c.close; >>>> } >>>> >>>> for 1..$workers { >>>> @w.push: start { >>>> react { >>>> whenever $c -> $str { >>>> worker( $str ); >>>> } >>>> } >>>> } >>>> } >>>> >>>> await @w; >>>> } >>>> >>>> my $b = Bench.new; >>>> $b.cmpthese( >>>> 1, >>>> { >>>> workers1 => sub { run( 1 ) }, >>>> workers5 => sub { run( 5 ) }, >>>> workers10 => sub { run( 10 ) }, >>>> workers15 => sub { run( 15 ) }, >>>> } >>>> ); >>>> >>>> I tried this code with a macOS installation of Rakudo and with a Linux >>>> in a >>>> VM box. Here is macOS results (6 CPU cores): >>>> >>>> Timing 1 iterations of workers1, workers10, workers15, workers5... >>>> workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @ >>>> 0.037/s (n=1) >>>> (warning: too few iterations for a reliable count) >>>> workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @ >>>> 0.133/s (n=1) >>>> (warning: too few iterations for a reliable count) >>>> workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ >>>> 0.126/s >>>> (n=1) >>>> (warning: too few iterations for a reliable count) >>>> workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ >>>> 0.106/s >>>> (n=1) >>>> (warning: too few iterations for a reliable count) >>>> O-----------O----------O----------O-----------O-----------O----------O >>>> | | s/iter | workers1 | workers10 | workers15 | workers5 | >>>> O===========O==========O==========O===========O===========O==========O >>>> | workers1 | 27176370 | -- | -72% | -71% | -65% | >>>> | workers10 | 7503726 | 262% | -- | 6% | 26% | >>>> | workers15 | 7938428 | 242% | -5% | -- | 19% | >>>> | workers5 | 9452421 | 188% | -21% | -16% | -- | >>>> ---------------------------------------------------------------------- >>>> >>>> And Linux (4 virtual cores): >>>> >>>> Timing 1 iterations of workers1, workers10, workers15, workers5... >>>> workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @ >>>> 0.037/s (n=1) >>>> (warning: too few iterations for a reliable count) >>>> workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @ >>>> 0.097/s (n=1) >>>> (warning: too few iterations for a reliable count) >>>> workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @ >>>> 0.098/s (n=1) >>>> (warning: too few iterations for a reliable count) >>>> workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @ >>>> 0.094/s (n=1) >>>> (warning: too few iterations for a reliable count) >>>> O-----------O----------O----------O----------O-----------O-----------O >>>> | | s/iter | workers5 | workers1 | workers15 | workers10 | >>>> O===========O==========O==========O==========O===========O===========O >>>> | workers5 | 10663102 | -- | 155% | -4% | -3% | >>>> | workers1 | 27240221 | -61% | -- | -62% | -62% | >>>> | workers15 | 10220862 | 4% | 167% | -- | 1% | >>>> | workers10 | 10338829 | 3% | 163% | -1% | -- | >>>> ---------------------------------------------------------------------- >>>> >>>> Am I missing something here? Do I do something wrong? Because it just >>>> doesn't fit into my mind... >>>> >>>> As a side done: by playing with 1-2-3 workers I see that each new >>>> thread >>>> gradually adds atop of the total run time until a plato is reached. The >>>> plato is seemingly defined by the number of cores or, more correctly, by >>>> the >>>> number of supported threads. Proving this hypothesis wold require more >>>> time >>>> than I have on my hands right now. And not even sure if such proof ever >>>> makes sense. >>>> >>>> Best regards, >>>> Vadim Belman >>>> >>>> >>> >>> >>> -- >>> -y >> > > Best regards, > Vadim Belman > >
