Is it possible that OS caching is having an effect on the performance?
It's sometimes necessary to run the same code several times before it
settles down to consistent results.

On 12/7/18, Vadim Belman <[email protected]> wrote:
> There is not need for filling in the channel prior to starting workers.
> First of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my
> system. I didn't do benchmarking of the generator thread, but considering
> that even your timing gives 0.054sec/per string – I will most definitely
> remain fast enough to provide all workers with data. But even with this in
> mind I re-run the test with only 100 characters long strings being
> generated. Here is what I've got:
>
> Benchmark:
> Timing 1 iterations of workers1, workers10, workers15, workers2, workers3,
> workers5...
>   workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @
> 0.044/s (n=1)
>               (warning: too few iterations for a reliable count)
>  workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @
> 0.162/s (n=1)
>               (warning: too few iterations for a reliable count)
>  workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.162/s
> (n=1)
>               (warning: too few iterations for a reliable count)
>   workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @
> 0.071/s (n=1)
>               (warning: too few iterations for a reliable count)
>   workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @
> 0.095/s (n=1)
>               (warning: too few iterations for a reliable count)
>   workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s
> (n=1)
>               (warning: too few iterations for a reliable count)
> O-----------O----------O----------O-----------O----------O-----------O----------O----------O
> |           | s/iter   | workers3 | workers15 | workers5 | workers10 |
> workers2 | workers1 |
> O===========O==========O==========O===========O==========O===========O==========O==========O
> | workers3  | 10553022 | --       | -42%      | -28%     | -42%      | 34%
>    | 113%     |
> | workers15 | 6165235  | 71%      | --        | 24%      | -0%       | 129%
>    | 265%     |
> | workers5  | 7650413  | 38%      | -19%      | --       | -20%      | 84%
>    | 194%     |
> | workers10 | 6154300  | 71%      | 0%        | 24%      | --        | 129%
>    | 265%     |
> | workers2  | 14101512 | -25%     | -56%      | -46%     | -56%      | --
>    | 59%      |
> | workers1  | 22473185 | -53%     | -73%      | -66%     | -73%      | -37%
>    | --       |
> --------------------------------------------------------------------------------------------
>
> What's more important is the observation for the CPU consumption by the moar
> process. Depending on the number of workers I was getting numbers from 100%
> load for a single one up to 1000% for the whole bunch of 15. This perfectly
> corresponds with 6 cores/2 threads per core of my CPU.
>
>> On Dec 7, 2018, at 02:06, yary <[email protected]> wrote:
>>
>> That was a bit vague- meant that I suspect the workers are being
>> starved, since you have many consumers, and only a single thread
>> generating the 1k strings. I would prime the channel to be  full - or
>> other restructuring the ensure all threads are kept busy.
>>
>> -y
>>
>> On Thu, Dec 6, 2018 at 10:56 PM yary <[email protected]> wrote:
>>>
>>> Not sure if your test is measuring what you expect- the setup of
>>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
>>> reducing the apparent effect of parllelism.
>>>
>>> $ perl6
>>> To exit type 'exit' or '^D'
>>>> my $c = Channel.new;
>>> Channel.new
>>>> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say
>>>> now - ENTER now; }
>>> 2.7289092
>>>
>>> I'd move the setup outside the "cmpthese" and try again, re-think the
>>> new results.
>>>
>>>
>>>
>>> On 12/6/18, Vadim Belman <[email protected]> wrote:
>>>> Hi everybody!
>>>>
>>>> I have recently played a bit with somewhat intense computations and
>>>> tried to
>>>> parallelize them among a couple of threaded workers. The results were
>>>> somewhat... eh... discouraging. To sum up my findings I wrote a simple
>>>> demo
>>>> benchmark:
>>>>
>>>>     use Digest::SHA;
>>>>     use Bench;
>>>>
>>>>     sub worker ( Str:D $str ) {
>>>>         my $digest = $str;
>>>>
>>>>         for 1..100 {
>>>>             $digest = sha256 $digest;
>>>>         }
>>>>     }
>>>>
>>>>     sub run ( Int $workers ) {
>>>>         my $c = Channel.new;
>>>>
>>>>         my @w;
>>>>         @w.push: start {
>>>>             for 1..50 {
>>>>                 $c.send(
>>>>                     (1..1024).map( { (' '..'Z').pick } ).join
>>>>                 );
>>>>             }
>>>>             LEAVE $c.close;
>>>>         }
>>>>
>>>>         for 1..$workers {
>>>>             @w.push: start {
>>>>                 react {
>>>>                     whenever $c -> $str {
>>>>                         worker( $str );
>>>>                     }
>>>>                 }
>>>>             }
>>>>         }
>>>>
>>>>         await @w;
>>>>     }
>>>>
>>>>     my $b = Bench.new;
>>>>     $b.cmpthese(
>>>>         1,
>>>>         {
>>>>             workers1 => sub { run( 1 ) },
>>>>             workers5 => sub { run( 5 ) },
>>>>             workers10 => sub { run( 10 ) },
>>>>             workers15 => sub { run( 15 ) },
>>>>         }
>>>>     );
>>>>
>>>> I tried this code with a macOS installation of Rakudo and with a Linux
>>>> in a
>>>> VM box. Here is macOS results (6 CPU cores):
>>>>
>>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>>>  workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @
>>>> 0.037/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @
>>>> 0.133/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @
>>>> 0.126/s
>>>> (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>>  workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @
>>>> 0.106/s
>>>> (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> O-----------O----------O----------O-----------O-----------O----------O
>>>> |           | s/iter   | workers1 | workers10 | workers15 | workers5 |
>>>> O===========O==========O==========O===========O===========O==========O
>>>> | workers1  | 27176370 | --       | -72%      | -71%      | -65%     |
>>>> | workers10 | 7503726  | 262%     | --        | 6%        | 26%      |
>>>> | workers15 | 7938428  | 242%     | -5%       | --        | 19%      |
>>>> | workers5  | 9452421  | 188%     | -21%      | -16%      | --       |
>>>> ----------------------------------------------------------------------
>>>>
>>>> And Linux (4 virtual cores):
>>>>
>>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>>>  workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @
>>>> 0.037/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @
>>>> 0.097/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @
>>>> 0.098/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>>  workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @
>>>> 0.094/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> O-----------O----------O----------O----------O-----------O-----------O
>>>> |           | s/iter   | workers5 | workers1 | workers15 | workers10 |
>>>> O===========O==========O==========O==========O===========O===========O
>>>> | workers5  | 10663102 | --       | 155%     | -4%       | -3%       |
>>>> | workers1  | 27240221 | -61%     | --       | -62%      | -62%      |
>>>> | workers15 | 10220862 | 4%       | 167%     | --        | 1%        |
>>>> | workers10 | 10338829 | 3%       | 163%     | -1%       | --        |
>>>> ----------------------------------------------------------------------
>>>>
>>>> Am I missing something here? Do I do something wrong? Because it just
>>>> doesn't fit into my mind...
>>>>
>>>> As a side done: by playing with 1-2-3 workers I see that each new
>>>> thread
>>>> gradually adds atop of the total run time until a plato is reached. The
>>>> plato is seemingly defined by the number of cores or, more correctly, by
>>>> the
>>>> number of supported threads. Proving this hypothesis wold require more
>>>> time
>>>> than I have on my hands right now. And not even sure if such proof ever
>>>> makes sense.
>>>>
>>>> Best regards,
>>>> Vadim Belman
>>>>
>>>>
>>>
>>>
>>> --
>>> -y
>>
>
> Best regards,
> Vadim Belman
>
>

Reply via email to