Re: Performance of parallel computing.

Parrot Raiser Fri, 07 Dec 2018 07:56:03 -0800

Is it possible that OS caching is having an effect on the performance?
It's sometimes necessary to run the same code several times before it
settles down to consistent results.


On 12/7/18, Vadim Belman <[email protected]> wrote:
> There is not need for filling in the channel prior to starting workers.
> First of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my
> system. I didn't do benchmarking of the generator thread, but considering
> that even your timing gives 0.054sec/per string – I will most definitely
> remain fast enough to provide all workers with data. But even with this in
> mind I re-run the test with only 100 characters long strings being
> generated. Here is what I've got:
>
> Benchmark:
> Timing 1 iterations of workers1, workers10, workers15, workers2, workers3,
> workers5...
>   workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @
> 0.044/s (n=1)
>               (warning: too few iterations for a reliable count)
>  workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @
> 0.162/s (n=1)
>               (warning: too few iterations for a reliable count)
>  workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.162/s
> (n=1)
>               (warning: too few iterations for a reliable count)
>   workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @
> 0.071/s (n=1)
>               (warning: too few iterations for a reliable count)
>   workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @
> 0.095/s (n=1)
>               (warning: too few iterations for a reliable count)
>   workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s
> (n=1)
>               (warning: too few iterations for a reliable count)
> O-----------O----------O----------O-----------O----------O-----------O----------O----------O
> |           | s/iter   | workers3 | workers15 | workers5 | workers10 |
> workers2 | workers1 |
> O===========O==========O==========O===========O==========O===========O==========O==========O
> | workers3  | 10553022 | --       | -42%      | -28%     | -42%      | 34%
>    | 113%     |
> | workers15 | 6165235  | 71%      | --        | 24%      | -0%       | 129%
>    | 265%     |
> | workers5  | 7650413  | 38%      | -19%      | --       | -20%      | 84%
>    | 194%     |
> | workers10 | 6154300  | 71%      | 0%        | 24%      | --        | 129%
>    | 265%     |
> | workers2  | 14101512 | -25%     | -56%      | -46%     | -56%      | --
>    | 59%      |
> | workers1  | 22473185 | -53%     | -73%      | -66%     | -73%      | -37%
>    | --       |
> --------------------------------------------------------------------------------------------
>
> What's more important is the observation for the CPU consumption by the moar
> process. Depending on the number of workers I was getting numbers from 100%
> load for a single one up to 1000% for the whole bunch of 15. This perfectly
> corresponds with 6 cores/2 threads per core of my CPU.
>
>> On Dec 7, 2018, at 02:06, yary <[email protected]> wrote:
>>
>> That was a bit vague- meant that I suspect the workers are being
>> starved, since you have many consumers, and only a single thread
>> generating the 1k strings. I would prime the channel to be  full - or
>> other restructuring the ensure all threads are kept busy.
>>
>> -y
>>
>> On Thu, Dec 6, 2018 at 10:56 PM yary <[email protected]> wrote:
>>>
>>> Not sure if your test is measuring what you expect- the setup of
>>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
>>> reducing the apparent effect of parllelism.
>>>
>>> $ perl6
>>> To exit type 'exit' or '^D'
>>>> my $c = Channel.new;
>>> Channel.new
>>>> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say
>>>> now - ENTER now; }
>>> 2.7289092
>>>
>>> I'd move the setup outside the "cmpthese" and try again, re-think the
>>> new results.
>>>
>>>
>>>
>>> On 12/6/18, Vadim Belman <[email protected]> wrote:
>>>> Hi everybody!
>>>>
>>>> I have recently played a bit with somewhat intense computations and
>>>> tried to
>>>> parallelize them among a couple of threaded workers. The results were
>>>> somewhat... eh... discouraging. To sum up my findings I wrote a simple
>>>> demo
>>>> benchmark:
>>>>
>>>>     use Digest::SHA;
>>>>     use Bench;
>>>>
>>>>     sub worker ( Str:D $str ) {
>>>>         my $digest = $str;
>>>>
>>>>         for 1..100 {
>>>>             $digest = sha256 $digest;
>>>>         }
>>>>     }
>>>>
>>>>     sub run ( Int $workers ) {
>>>>         my $c = Channel.new;
>>>>
>>>>         my @w;
>>>>         @w.push: start {
>>>>             for 1..50 {
>>>>                 $c.send(
>>>>                     (1..1024).map( { (' '..'Z').pick } ).join
>>>>                 );
>>>>             }
>>>>             LEAVE $c.close;
>>>>         }
>>>>
>>>>         for 1..$workers {
>>>>             @w.push: start {
>>>>                 react {
>>>>                     whenever $c -> $str {
>>>>                         worker( $str );
>>>>                     }
>>>>                 }
>>>>             }
>>>>         }
>>>>
>>>>         await @w;
>>>>     }
>>>>
>>>>     my $b = Bench.new;
>>>>     $b.cmpthese(
>>>>         1,
>>>>         {
>>>>             workers1 => sub { run( 1 ) },
>>>>             workers5 => sub { run( 5 ) },
>>>>             workers10 => sub { run( 10 ) },
>>>>             workers15 => sub { run( 15 ) },
>>>>         }
>>>>     );
>>>>
>>>> I tried this code with a macOS installation of Rakudo and with a Linux
>>>> in a
>>>> VM box. Here is macOS results (6 CPU cores):
>>>>
>>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>>>  workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @
>>>> 0.037/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @
>>>> 0.133/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @
>>>> 0.126/s
>>>> (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>>  workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @
>>>> 0.106/s
>>>> (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> O-----------O----------O----------O-----------O-----------O----------O
>>>> |           | s/iter   | workers1 | workers10 | workers15 | workers5 |
>>>> O===========O==========O==========O===========O===========O==========O
>>>> | workers1  | 27176370 | --       | -72%      | -71%      | -65%     |
>>>> | workers10 | 7503726  | 262%     | --        | 6%        | 26%      |
>>>> | workers15 | 7938428  | 242%     | -5%       | --        | 19%      |
>>>> | workers5  | 9452421  | 188%     | -21%      | -16%      | --       |
>>>> ----------------------------------------------------------------------
>>>>
>>>> And Linux (4 virtual cores):
>>>>
>>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>>>  workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @
>>>> 0.037/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @
>>>> 0.097/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @
>>>> 0.098/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>>  workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @
>>>> 0.094/s (n=1)
>>>>              (warning: too few iterations for a reliable count)
>>>> O-----------O----------O----------O----------O-----------O-----------O
>>>> |           | s/iter   | workers5 | workers1 | workers15 | workers10 |
>>>> O===========O==========O==========O==========O===========O===========O
>>>> | workers5  | 10663102 | --       | 155%     | -4%       | -3%       |
>>>> | workers1  | 27240221 | -61%     | --       | -62%      | -62%      |
>>>> | workers15 | 10220862 | 4%       | 167%     | --        | 1%        |
>>>> | workers10 | 10338829 | 3%       | 163%     | -1%       | --        |
>>>> ----------------------------------------------------------------------
>>>>
>>>> Am I missing something here? Do I do something wrong? Because it just
>>>> doesn't fit into my mind...
>>>>
>>>> As a side done: by playing with 1-2-3 workers I see that each new
>>>> thread
>>>> gradually adds atop of the total run time until a plato is reached. The
>>>> plato is seemingly defined by the number of cores or, more correctly, by
>>>> the
>>>> number of supported threads. Proving this hypothesis wold require more
>>>> time
>>>> than I have on my hands right now. And not even sure if such proof ever
>>>> makes sense.
>>>>
>>>> Best regards,
>>>> Vadim Belman
>>>>
>>>>
>>>
>>>
>>> --
>>> -y
>>
>
> Best regards,
> Vadim Belman
>
>

Re: Performance of parallel computing.

Reply via email to