Re: Performance of parallel computing.

2018-12-07 Thread Vadim Belman
You're damn right here. First of all, I must admit that I've misinterpreted the 
benchmark results (guilty). Yet, anyway, I think I know what's really happening 
here. To make things really clear I ran the benchmark for all number of workers 
from 1 to 9. Here is a cleaned up output:

 Timing 1 iterations of worker1, worker2, worker3, worker4, worker5, 
worker6, worker7, worker8, worker9...
worker1: 22.125 wallclock secs (22.296 usr 0.248 sys 22.544 cpu) @ 
0.045/s (n=1)
worker2: 12.554 wallclock secs (24.221 usr 0.715 sys 24.936 cpu) @ 
0.080/s (n=1)
worker3: 9.330 wallclock secs (25.708 usr 1.316 sys 27.024 cpu) @ 
0.107/s (n=1)
worker4: 8.221 wallclock secs (28.151 usr 2.676 sys 30.827 cpu) @ 
0.122/s (n=1)
worker5: 7.131 wallclock secs (30.395 usr 3.658 sys 34.053 cpu) @ 
0.140/s (n=1)
worker6: 7.180 wallclock secs (34.496 usr 4.479 sys 38.975 cpu) @ 
0.139/s (n=1)
worker7: 7.050 wallclock secs (38.267 usr 5.453 sys 43.720 cpu) @ 
0.142/s (n=1)
worker8: 6.668 wallclock secs (41.607 usr 5.586 sys 47.194 cpu) @ 
0.150/s (n=1)
worker9: 7.220 wallclock secs (46.762 usr 11.647 sys 58.409 cpu) @ 
0.139/s (n=1)
 O-O--O-O
 | | s/iter   | worker1 |
 O=O==O=O
 | worker1 | 22125229 | --  |
 | worker2 | 12554094 | 76% |
 | worker3 | 9329865  | 137%|
 | worker4 | 8221486  | 169%|
 | worker5 | 7130758  | 210%|
 | worker6 | 7180343  | 208%|
 | worker7 | 7049935  | 214%|
 | worker8 | 6667794  | 232%|
 | worker9 | 7219864  | 206%|
 

The plateau is there but it's been reached even before we ran out of all the 
available cores: 5 workers takes all of the CPU power already. Yet, the speedup 
achieved is really much less that it'd expected... But then I realized that 
there is another player on the field: throttling. And that actually makes any 
other measurements useless on my notebook.

This is also an answer to Parrot's suggestion about possible caches 
involvement: that's not it, for sure. Especially if we take into account that 
the numbers were +/- the same on every benchmark run.

> On Dec 7, 2018, at 12:04, yary  wrote:
> 
> OK... going back to the hypothesis in the OP
> 
>> The plateau is seemingly defined by the number of cores or, more correctly, 
>> by the number of supported threads.
> 
> This suggests that the benchmark is CPU-bound, which is supported by
> your more recent observation "100% load for a single one"
> 
> Also, you mentioned running MacOS with two threads per core, which
> implies Intel's hyperthreading. Depending on the workload, CPU-bound
> processes sharing a hyperthreaded core see a speedup of 0-30%, as
> opposed to running on separate cores which can give a speedup of 100%.
> (Back when I searched for large primes, HT gave a 25% speed boost.) So
> with 6 cores, 2 HT per core, I would expect a max parallel boost of 6
> * (1x +0.30x) = 7.8x - and your test is only giving half that.
> 
> -y
> 

Best regards,
Vadim Belman



Re: Performance of parallel computing.

2018-12-07 Thread yary
OK... going back to the hypothesis in the OP

> The plateau is seemingly defined by the number of cores or, more correctly, 
> by the number of supported threads.

This suggests that the benchmark is CPU-bound, which is supported by
your more recent observation "100% load for a single one"

Also, you mentioned running MacOS with two threads per core, which
implies Intel's hyperthreading. Depending on the workload, CPU-bound
processes sharing a hyperthreaded core see a speedup of 0-30%, as
opposed to running on separate cores which can give a speedup of 100%.
(Back when I searched for large primes, HT gave a 25% speed boost.) So
with 6 cores, 2 HT per core, I would expect a max parallel boost of 6
* (1x +0.30x) = 7.8x - and your test is only giving half that.

-y


Re: Performance of parallel computing.

2018-12-07 Thread Parrot Raiser
Is it possible that OS caching is having an effect on the performance?
It's sometimes necessary to run the same code several times before it
settles down to consistent results.

On 12/7/18, Vadim Belman  wrote:
> There is not need for filling in the channel prior to starting workers.
> First of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my
> system. I didn't do benchmarking of the generator thread, but considering
> that even your timing gives 0.054sec/per string – I will most definitely
> remain fast enough to provide all workers with data. But even with this in
> mind I re-run the test with only 100 characters long strings being
> generated. Here is what I've got:
>
> Benchmark:
> Timing 1 iterations of workers1, workers10, workers15, workers2, workers3,
> workers5...
>   workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @
> 0.044/s (n=1)
>   (warning: too few iterations for a reliable count)
>  workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @
> 0.162/s (n=1)
>   (warning: too few iterations for a reliable count)
>  workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.162/s
> (n=1)
>   (warning: too few iterations for a reliable count)
>   workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @
> 0.071/s (n=1)
>   (warning: too few iterations for a reliable count)
>   workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @
> 0.095/s (n=1)
>   (warning: too few iterations for a reliable count)
>   workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s
> (n=1)
>   (warning: too few iterations for a reliable count)
> O---O--O--O---O--O---O--O--O
> |   | s/iter   | workers3 | workers15 | workers5 | workers10 |
> workers2 | workers1 |
> O===O==O==O===O==O===O==O==O
> | workers3  | 10553022 | --   | -42%  | -28% | -42%  | 34%
>| 113% |
> | workers15 | 6165235  | 71%  | --| 24%  | -0%   | 129%
>| 265% |
> | workers5  | 7650413  | 38%  | -19%  | --   | -20%  | 84%
>| 194% |
> | workers10 | 6154300  | 71%  | 0%| 24%  | --| 129%
>| 265% |
> | workers2  | 14101512 | -25% | -56%  | -46% | -56%  | --
>| 59%  |
> | workers1  | 22473185 | -53% | -73%  | -66% | -73%  | -37%
>| --   |
> 
>
> What's more important is the observation for the CPU consumption by the moar
> process. Depending on the number of workers I was getting numbers from 100%
> load for a single one up to 1000% for the whole bunch of 15. This perfectly
> corresponds with 6 cores/2 threads per core of my CPU.
>
>> On Dec 7, 2018, at 02:06, yary  wrote:
>>
>> That was a bit vague- meant that I suspect the workers are being
>> starved, since you have many consumers, and only a single thread
>> generating the 1k strings. I would prime the channel to be  full - or
>> other restructuring the ensure all threads are kept busy.
>>
>> -y
>>
>> On Thu, Dec 6, 2018 at 10:56 PM yary  wrote:
>>>
>>> Not sure if your test is measuring what you expect- the setup of
>>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
>>> reducing the apparent effect of parllelism.
>>>
>>> $ perl6
>>> To exit type 'exit' or '^D'
 my $c = Channel.new;
>>> Channel.new
 { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say
 now - ENTER now; }
>>> 2.7289092
>>>
>>> I'd move the setup outside the "cmpthese" and try again, re-think the
>>> new results.
>>>
>>>
>>>
>>> On 12/6/18, Vadim Belman  wrote:
 Hi everybody!

 I have recently played a bit with somewhat intense computations and
 tried to
 parallelize them among a couple of threaded workers. The results were
 somewhat... eh... discouraging. To sum up my findings I wrote a simple
 demo
 benchmark:

 use Digest::SHA;
 use Bench;

 sub worker ( Str:D $str ) {
 my $digest = $str;

 for 1..100 {
 $digest = sha256 $digest;
 }
 }

 sub run ( Int $workers ) {
 my $c = Channel.new;

 my @w;
 @w.push: start {
 for 1..50 {
 $c.send(
 (1..1024).map( { (' '..'Z').pick } ).join
 );
 }
 LEAVE $c.close;
 }

 for 1..$workers {
 @w.push: start {
 react {
 whenever $c -> $str {
 worker( $str );
 }
 }

Re: Performance of parallel computing.

2018-12-07 Thread Vadim Belman
There is not need for filling in the channel prior to starting workers. First 
of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my system. 
I didn't do benchmarking of the generator thread, but considering that even 
your timing gives 0.054sec/per string – I will most definitely remain fast 
enough to provide all workers with data. But even with this in mind I re-run 
the test with only 100 characters long strings being generated. Here is what 
I've got:

Benchmark:
Timing 1 iterations of workers1, workers10, workers15, workers2, workers3, 
workers5...
  workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @ 0.044/s 
(n=1)
(warning: too few iterations for a reliable count)
 workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @ 0.162/s 
(n=1)
(warning: too few iterations for a reliable count)
 workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.162/s 
(n=1)
(warning: too few iterations for a reliable count)
  workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @ 0.071/s 
(n=1)
(warning: too few iterations for a reliable count)
  workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @ 0.095/s 
(n=1)
(warning: too few iterations for a reliable count)
  workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s 
(n=1)
(warning: too few iterations for a reliable count)
O---O--O--O---O--O---O--O--O
|   | s/iter   | workers3 | workers15 | workers5 | workers10 | workers2 
| workers1 |
O===O==O==O===O==O===O==O==O
| workers3  | 10553022 | --   | -42%  | -28% | -42%  | 34%  
| 113% |
| workers15 | 6165235  | 71%  | --| 24%  | -0%   | 129% 
| 265% |
| workers5  | 7650413  | 38%  | -19%  | --   | -20%  | 84%  
| 194% |
| workers10 | 6154300  | 71%  | 0%| 24%  | --| 129% 
| 265% |
| workers2  | 14101512 | -25% | -56%  | -46% | -56%  | --   
| 59%  |
| workers1  | 22473185 | -53% | -73%  | -66% | -73%  | -37% 
| --   |


What's more important is the observation for the CPU consumption by the moar 
process. Depending on the number of workers I was getting numbers from 100% 
load for a single one up to 1000% for the whole bunch of 15. This perfectly 
corresponds with 6 cores/2 threads per core of my CPU.

> On Dec 7, 2018, at 02:06, yary  wrote:
> 
> That was a bit vague- meant that I suspect the workers are being
> starved, since you have many consumers, and only a single thread
> generating the 1k strings. I would prime the channel to be  full - or
> other restructuring the ensure all threads are kept busy.
> 
> -y
> 
> On Thu, Dec 6, 2018 at 10:56 PM yary  wrote:
>> 
>> Not sure if your test is measuring what you expect- the setup of
>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
>> reducing the apparent effect of parllelism.
>> 
>> $ perl6
>> To exit type 'exit' or '^D'
>>> my $c = Channel.new;
>> Channel.new
>>> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say now 
>>> - ENTER now; }
>> 2.7289092
>> 
>> I'd move the setup outside the "cmpthese" and try again, re-think the
>> new results.
>> 
>> 
>> 
>> On 12/6/18, Vadim Belman  wrote:
>>> Hi everybody!
>>> 
>>> I have recently played a bit with somewhat intense computations and tried to
>>> parallelize them among a couple of threaded workers. The results were
>>> somewhat... eh... discouraging. To sum up my findings I wrote a simple demo
>>> benchmark:
>>> 
>>> use Digest::SHA;
>>> use Bench;
>>> 
>>> sub worker ( Str:D $str ) {
>>> my $digest = $str;
>>> 
>>> for 1..100 {
>>> $digest = sha256 $digest;
>>> }
>>> }
>>> 
>>> sub run ( Int $workers ) {
>>> my $c = Channel.new;
>>> 
>>> my @w;
>>> @w.push: start {
>>> for 1..50 {
>>> $c.send(
>>> (1..1024).map( { (' '..'Z').pick } ).join
>>> );
>>> }
>>> LEAVE $c.close;
>>> }
>>> 
>>> for 1..$workers {
>>> @w.push: start {
>>> react {
>>> whenever $c -> $str {
>>> worker( $str );
>>> }
>>> }
>>> }
>>> }
>>> 
>>> await @w;
>>> }
>>> 
>>> my $b = Bench.new;
>>> $b.cmpthese(
>>> 1,
>>> {
>>> workers1 => sub { run( 1 ) },
>>> workers5 => sub { run( 5 ) },
>>> workers10 => sub { run( 10 ) },
>>> workers15 => sub { r