Re: [hpx-users] Performance Counter Data Interpretation

2018-01-09 Thread Thomas Heller
Hi Kilian,

The cumulative overhead counter indeed looks very odd. What it means is
the accumulated time *not* spend doing useful work.
Let's break it appart (For the single thread case):
 - The overall runtime of your application appears to be 2 seconds.
 - The number of total tasks run is 127704 with an average task length
   of about 15 micro seconds
 - The cumulative-overhead and the idle rate are closely related. about
   5.14% of the total runtime is spent in the scheduler. This is a good
   starting point

Now, when looking at the two core case, the overall number of tasks
doesn't significantly change, however, the average runtime increases
noticeably. What you observe in addition, is that the idle rate is
clocking in at 44%. This matches with the accumulated time spent in the
scheduler (the overhead) and the accumulated time spent doing useful
work. This trend is then followed by the 8 core run.

This could mean a variety of different things, the obvious one is this:
The average task duration is too little. That is, your granularity
appears to be too fine. IIRC, we recommend an average task length of
about 100 micro seconds (one magnitude different to your results).

What you see essentially is an increase of overhead due to the task
queues in the scheduler fighting for work.

To quickly test this hypothesis, you could artificially increase the
runtime of your tasks and see if that increases scalabilty. If that
helps, we have a winner and you need implement some kind of granularity
control, such that your tasks perform about 10 times more work, which
allows to mitigate the overheads more easily.

Hope that helps,
Thomas

On 01/09/2018 03:55 PM, Kilian Werner wrote:
> Dear Prof. Kaiser,
> 
> it is a new application we are building with hpx used for 
> parallelization (and a later distribution in mind).
> We have been working on it for the last two months and the 
> speedups have been this bad right from the start.
> However since the obvious, inherent bottlenecks were dealt 
> with one after the other we went to performance counters 
> for more detailed profiling.
> 
> Thanks,
> Kilian Werner
> 
> On Tue, 9 Jan 2018 08:47:12 -0600
>   "Hartmut Kaiser"  wrote:
>> Kilian,
>>
>> Was this slowdown happening always or did it just 
>> started to be bad
>> recently?
>>
>> Thanks!
>> Regards Hartmut
>> ---
>> http://boost-spirit.com
>> http://stellar.cct.lsu.edu
>>
>>
>>> -Original Message-
>>> From: hpx-users-boun...@stellar.cct.lsu.edu 
>>> [mailto:hpx-users-
>>> boun...@stellar.cct.lsu.edu] On Behalf Of Kilian Werner
>>> Sent: Tuesday, January 9, 2018 7:46 AM
>>> To: hpx-users@stellar.cct.lsu.edu
>>> Subject: [hpx-users] Performance Counter Data 
>>> Interpretation
>>>
>>> Dear hpx user list,
>>>
>>> one of our projects shows unexpectedly bad speedups when
>>> supplying additional OS-worker-threads to HPX.
>>> The project is run locally and in parallel on a machine
>>> with 8 cores, trying to pin down the parallelization
>>> bottleneck we printed the built in HPX Performance
>>> Counters as seen below.
>>> The parallelization is achieved by scheduling tasks with
>>> hpx::apply that themselves will schedule additional 
>>> tasks
>>> with hpx::apply.
>>> The program terminates after a final task (that can
>>> identify itself and will always finish last, independent
>>> of task scheduling order) fires an event.
>>> Synchronization is performed with some
>>> hpx::lcos::local::mutex locks.
>>>
>>> The problem seems to be apparent when looking at the
>>> harshly growing cumulative-overhead per worker-thread 
>>> when
>>> employing more OS threads.
>>> However we are a bit clueless as to interpret the 
>>> meaning
>>> of this cumulative-overhead counter.
>>> We were especially surprised to find, that the
>>> per-worker-thread overhead at some point came close to 
>>> and
>>> even surpassed the total cumulative runtime (see
>>> cumulative overhead of worker thread 0 when run with 8 
>>> os
>>> threads  vs. total cumulative runtime).
>>>
>>> What exactly does the performance counter
>>> /threads/time/cumulative-overhead measure? How can the
>>> overhead be larger than the total execution time?
>>> How could we narrow down the causes for the growing
>>> overhead? For example how could we measure how much time
>>> is spend waiting at (specific) mutexes  in total?
>>>
>>> Thanks in advance,
>>>
>>> Kilian Werner
>>>
>>>
>>>
>>> --hpx:threads 1:
>>>
>>> /threads{locality#0/total/total}/count/cumulative,1,2.015067,[s],127704
>>> /threads{locality#0/total/total}/time/average,1,2.015073,[s],14938,[ns]
>>> /threads{locality#0/total/total}/time/cumulative,1,2.015074,[s],1.90769e+0
>>> 9,[ns]
>>> /threads{locality#0/total/total}/time/cumulative-
>>> overhead,1,2.015076,[s],1.03483e+08,[ns]
>>> /threads{locality#0/pool#default/worker-thread#0}/time/cumulative-
>>> overhead,1,2.015076,[s],1.03483e+08,[ns]
>>> /threads{locality#0/total/total}/idle-rate,1,2.015078,[s],514,[0.01%]

Re: [hpx-users] Performance Counter Data Interpretation

2018-01-09 Thread Hartmut Kaiser
Kilian,

> it is a new application we are building with hpx used for
> parallelization (and a later distribution in mind).
> We have been working on it for the last two months and the
> speedups have been this bad right from the start.
> However since the obvious, inherent bottlenecks were dealt
> with one after the other we went to performance counters
> for more detailed profiling.

Ok, I just wanted to make whether we recently broke something.

In general, high overheads point towards either too little parallelism or
too fine grain tasks. From the performance counter data collected when
running on one core you can discern that it's at least the latter:

> threads{locality#0/total/total}/time/average,1,2.015073,[s],14938,[ns]

This tells you that the average execution time of the HPX threads is in the
range of 15 microseconds. Taking into account that the overheads introduced
by the creation, scheduling, running, and destruction of a single HPX is in
the range of one microsecond, this average time hints at that your
parallelism is too fine-grained. I'd suggest trying to somehow combine
several tasks into one to be run sequentially by one HPX thread. Average
execution times of 150-200 microseconds are a good number for HPX.

You can also see that your idle-rate is increasing significantly once you
increase the number of cores to run on. This might indicate that you have
not enough parallelism in your application to warrant parallelization to the
extent you try it to run on. Once you run on 2 cores your idle-rate goes up
to 45%, which is quite a large number (idle-rates up to 15% are relatively
ok, usually).

>From what I can see your application launches a lot of very small tasks
which depend sequentially on each other. Alternatively, you have a lot of
cross-thread synchronization going on, which prevents efficient
parallelization. But that is pure conjecture and I might be completely
wrong.

The fact that you're mostly using 'apply' to schedule your threads hints at
the fact that you need to do your own synchronization which could be causing
the problems. Changing this to use futures as the means of synchronization
(i.e. use async instead of apply) and build asynchronous execution trees out
of those futures may reduce contention and may remove the need to explicitly
synchronize between threads. 

In general, I'd suggest using a tool like APEX (non-Windows only, preferred
for Linux) or Intel Amplifier (preferred for Windows, but works on Linux as
well) to visualize your real task dependencies at runtime. Both tools
require special configuration but both of them are directly supported by
HPX. Please let us know if you need more information

HTH
Regards Hartmut
---
http://boost-spirit.com
http://stellar.cct.lsu.edu


> 
> Thanks,
> Kilian Werner
> 
> On Tue, 9 Jan 2018 08:47:12 -0600
>   "Hartmut Kaiser"  wrote:
> > Kilian,
> >
> > Was this slowdown happening always or did it just
> >started to be bad
> > recently?
> >
> > Thanks!
> > Regards Hartmut
> > ---
> > http://boost-spirit.com
> > http://stellar.cct.lsu.edu
> >
> >
> >> -Original Message-
> >> From: hpx-users-boun...@stellar.cct.lsu.edu
> >>[mailto:hpx-users-
> >> boun...@stellar.cct.lsu.edu] On Behalf Of Kilian Werner
> >> Sent: Tuesday, January 9, 2018 7:46 AM
> >> To: hpx-users@stellar.cct.lsu.edu
> >> Subject: [hpx-users] Performance Counter Data
> >>Interpretation
> >>
> >> Dear hpx user list,
> >>
> >> one of our projects shows unexpectedly bad speedups when
> >> supplying additional OS-worker-threads to HPX.
> >> The project is run locally and in parallel on a machine
> >> with 8 cores, trying to pin down the parallelization
> >> bottleneck we printed the built in HPX Performance
> >> Counters as seen below.
> >> The parallelization is achieved by scheduling tasks with
> >> hpx::apply that themselves will schedule additional
> >>tasks
> >> with hpx::apply.
> >> The program terminates after a final task (that can
> >> identify itself and will always finish last, independent
> >> of task scheduling order) fires an event.
> >> Synchronization is performed with some
> >> hpx::lcos::local::mutex locks.
> >>
> >> The problem seems to be apparent when looking at the
> >> harshly growing cumulative-overhead per worker-thread
> >>when
> >> employing more OS threads.
> >> However we are a bit clueless as to interpret the
> >>meaning
> >> of this cumulative-overhead counter.
> >> We were especially surprised to find, that the
> >> per-worker-thread overhead at some point came close to
> >>and
> >> even surpassed the total cumulative runtime (see
> >> cumulative overhead of worker thread 0 when run with 8
> >>os
> >> threads  vs. total cumulative runtime).
> >>
> >> What exactly does the performance counter
> >> /threads/time/cumulative-overhead measure? How can the
> >> overhead be larger than the total execution time?
> >> How could we narrow down the causes for the growing
> >> overhead? For 

Re: [hpx-users] Performance Counter Data Interpretation

2018-01-09 Thread Kilian Werner
Dear Prof. Kaiser,

it is a new application we are building with hpx used for 
parallelization (and a later distribution in mind).
We have been working on it for the last two months and the 
speedups have been this bad right from the start.
However since the obvious, inherent bottlenecks were dealt 
with one after the other we went to performance counters 
for more detailed profiling.

Thanks,
Kilian Werner

On Tue, 9 Jan 2018 08:47:12 -0600
  "Hartmut Kaiser"  wrote:
> Kilian,
> 
> Was this slowdown happening always or did it just 
>started to be bad
> recently?
> 
> Thanks!
> Regards Hartmut
> ---
> http://boost-spirit.com
> http://stellar.cct.lsu.edu
> 
> 
>> -Original Message-
>> From: hpx-users-boun...@stellar.cct.lsu.edu 
>>[mailto:hpx-users-
>> boun...@stellar.cct.lsu.edu] On Behalf Of Kilian Werner
>> Sent: Tuesday, January 9, 2018 7:46 AM
>> To: hpx-users@stellar.cct.lsu.edu
>> Subject: [hpx-users] Performance Counter Data 
>>Interpretation
>> 
>> Dear hpx user list,
>> 
>> one of our projects shows unexpectedly bad speedups when
>> supplying additional OS-worker-threads to HPX.
>> The project is run locally and in parallel on a machine
>> with 8 cores, trying to pin down the parallelization
>> bottleneck we printed the built in HPX Performance
>> Counters as seen below.
>> The parallelization is achieved by scheduling tasks with
>> hpx::apply that themselves will schedule additional 
>>tasks
>> with hpx::apply.
>> The program terminates after a final task (that can
>> identify itself and will always finish last, independent
>> of task scheduling order) fires an event.
>> Synchronization is performed with some
>> hpx::lcos::local::mutex locks.
>> 
>> The problem seems to be apparent when looking at the
>> harshly growing cumulative-overhead per worker-thread 
>>when
>> employing more OS threads.
>> However we are a bit clueless as to interpret the 
>>meaning
>> of this cumulative-overhead counter.
>> We were especially surprised to find, that the
>> per-worker-thread overhead at some point came close to 
>>and
>> even surpassed the total cumulative runtime (see
>> cumulative overhead of worker thread 0 when run with 8 
>>os
>> threads  vs. total cumulative runtime).
>> 
>> What exactly does the performance counter
>> /threads/time/cumulative-overhead measure? How can the
>> overhead be larger than the total execution time?
>> How could we narrow down the causes for the growing
>> overhead? For example how could we measure how much time
>> is spend waiting at (specific) mutexes  in total?
>> 
>> Thanks in advance,
>> 
>> Kilian Werner
>> 
>> 
>> 
>> --hpx:threads 1:
>> 
>> /threads{locality#0/total/total}/count/cumulative,1,2.015067,[s],127704
>> /threads{locality#0/total/total}/time/average,1,2.015073,[s],14938,[ns]
>> /threads{locality#0/total/total}/time/cumulative,1,2.015074,[s],1.90769e+0
>> 9,[ns]
>> /threads{locality#0/total/total}/time/cumulative-
>> overhead,1,2.015076,[s],1.03483e+08,[ns]
>> /threads{locality#0/pool#default/worker-thread#0}/time/cumulative-
>> overhead,1,2.015076,[s],1.03483e+08,[ns]
>> /threads{locality#0/total/total}/idle-rate,1,2.015078,[s],514,[0.01%]
>> 
>> --hpx:threads 2:
>> 
>> /threads{locality#0/total/total}/count/cumulative,1,1.814639,[s],112250
>> /threads{locality#0/total/total}/time/average,1,1.814644,[s],17986,[ns]
>> /threads{locality#0/total/total}/time/cumulative,1,1.814654,[s],2.01907e+0
>> 9,[ns]
>> /threads{locality#0/total/total}/time/cumulative-
>> overhead,1,1.814647,[s],1.60469e+09,[ns]
>> /threads{locality#0/pool#default/worker-thread#0}/time/cumulative-
>> overhead,1,1.814599,[s],1.12562e+09,[ns]
>> /threads{locality#0/pool#default/worker-thread#1}/time/cumulative-
>> overhead,1,1.814649,[s],4.79071e+08,[ns]
>> /threads{locality#0/total/total}/idle-rate,1,1.814603,[s],4428,[0.01%]
>> 
>> --hpx:threads 8:
>> 
>> /threads{locality#0/total/total}/count/cumulative,1,4.597361,[s],109476
>> /threads{locality#0/total/total}/time/average,1,4.597373,[s],37988,[ns]
>> /threads{locality#0/total/total}/time/cumulative,1,4.597335,[s],4.1588e+09
>> ,[ns]
>> /threads{locality#0/total/total}/time/cumulative-
>> overhead,1,4.597325,[s],3.25232e+10,[ns]
>> /threads{locality#0/pool#default/worker-thread#0}/time/cumulative-
>> overhead,1,4.597408,[s],4.20735e+09,[ns]
>> /threads{locality#0/pool#default/worker-thread#1}/time/cumulative-
>> overhead,1,4.597390,[s],4.08787e+09,[ns]
>> /threads{locality#0/pool#default/worker-thread#2}/time/cumulative-
>> overhead,1,4.597385,[s],3.62298e+09,[ns]
>> /threads{locality#0/pool#default/worker-thread#3}/time/cumulative-
>> overhead,1,4.597358,[s],4.12475e+09,[ns]
>> /threads{locality#0/pool#default/worker-thread#4}/time/cumulative-
>> overhead,1,4.597338,[s],4.10011e+09,[ns]
>> /threads{locality#0/pool#default/worker-thread#5}/time/cumulative-
>> overhead,1,4.597402,[s],4.14242e+09,[ns]
>> /threads{locality#0/pool#default/worker-thread#6}/time/cumulative-
>> 

Re: [hpx-users] Performance Counter Data Interpretation

2018-01-09 Thread Hartmut Kaiser
Kilian,

Was this slowdown happening always or did it just started to be bad
recently?

Thanks!
Regards Hartmut
---
http://boost-spirit.com
http://stellar.cct.lsu.edu


> -Original Message-
> From: hpx-users-boun...@stellar.cct.lsu.edu [mailto:hpx-users-
> boun...@stellar.cct.lsu.edu] On Behalf Of Kilian Werner
> Sent: Tuesday, January 9, 2018 7:46 AM
> To: hpx-users@stellar.cct.lsu.edu
> Subject: [hpx-users] Performance Counter Data Interpretation
> 
> Dear hpx user list,
> 
> one of our projects shows unexpectedly bad speedups when
> supplying additional OS-worker-threads to HPX.
> The project is run locally and in parallel on a machine
> with 8 cores, trying to pin down the parallelization
> bottleneck we printed the built in HPX Performance
> Counters as seen below.
> The parallelization is achieved by scheduling tasks with
> hpx::apply that themselves will schedule additional tasks
> with hpx::apply.
> The program terminates after a final task (that can
> identify itself and will always finish last, independent
> of task scheduling order) fires an event.
> Synchronization is performed with some
> hpx::lcos::local::mutex locks.
> 
> The problem seems to be apparent when looking at the
> harshly growing cumulative-overhead per worker-thread when
> employing more OS threads.
> However we are a bit clueless as to interpret the meaning
> of this cumulative-overhead counter.
> We were especially surprised to find, that the
> per-worker-thread overhead at some point came close to and
> even surpassed the total cumulative runtime (see
> cumulative overhead of worker thread 0 when run with 8 os
> threads  vs. total cumulative runtime).
> 
> What exactly does the performance counter
> /threads/time/cumulative-overhead measure? How can the
> overhead be larger than the total execution time?
> How could we narrow down the causes for the growing
> overhead? For example how could we measure how much time
> is spend waiting at (specific) mutexes  in total?
> 
> Thanks in advance,
> 
> Kilian Werner
> 
> 
> 
> --hpx:threads 1:
> 
> /threads{locality#0/total/total}/count/cumulative,1,2.015067,[s],127704
> /threads{locality#0/total/total}/time/average,1,2.015073,[s],14938,[ns]
> /threads{locality#0/total/total}/time/cumulative,1,2.015074,[s],1.90769e+0
> 9,[ns]
> /threads{locality#0/total/total}/time/cumulative-
> overhead,1,2.015076,[s],1.03483e+08,[ns]
> /threads{locality#0/pool#default/worker-thread#0}/time/cumulative-
> overhead,1,2.015076,[s],1.03483e+08,[ns]
> /threads{locality#0/total/total}/idle-rate,1,2.015078,[s],514,[0.01%]
> 
> --hpx:threads 2:
> 
> /threads{locality#0/total/total}/count/cumulative,1,1.814639,[s],112250
> /threads{locality#0/total/total}/time/average,1,1.814644,[s],17986,[ns]
> /threads{locality#0/total/total}/time/cumulative,1,1.814654,[s],2.01907e+0
> 9,[ns]
> /threads{locality#0/total/total}/time/cumulative-
> overhead,1,1.814647,[s],1.60469e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#0}/time/cumulative-
> overhead,1,1.814599,[s],1.12562e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#1}/time/cumulative-
> overhead,1,1.814649,[s],4.79071e+08,[ns]
> /threads{locality#0/total/total}/idle-rate,1,1.814603,[s],4428,[0.01%]
> 
> --hpx:threads 8:
> 
> /threads{locality#0/total/total}/count/cumulative,1,4.597361,[s],109476
> /threads{locality#0/total/total}/time/average,1,4.597373,[s],37988,[ns]
> /threads{locality#0/total/total}/time/cumulative,1,4.597335,[s],4.1588e+09
> ,[ns]
> /threads{locality#0/total/total}/time/cumulative-
> overhead,1,4.597325,[s],3.25232e+10,[ns]
> /threads{locality#0/pool#default/worker-thread#0}/time/cumulative-
> overhead,1,4.597408,[s],4.20735e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#1}/time/cumulative-
> overhead,1,4.597390,[s],4.08787e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#2}/time/cumulative-
> overhead,1,4.597385,[s],3.62298e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#3}/time/cumulative-
> overhead,1,4.597358,[s],4.12475e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#4}/time/cumulative-
> overhead,1,4.597338,[s],4.10011e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#5}/time/cumulative-
> overhead,1,4.597402,[s],4.14242e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#6}/time/cumulative-
> overhead,1,4.597353,[s],4.13593e+09,[ns]
> /threads{locality#0/pool#default/worker-thread#7}/time/cumulative-
> overhead,1,4.597408,[s],4.13275e+09,[ns]
> /threads{locality#0/total/total}/idle-rate,1,4.597350,[s],8867,[0.01%]
> 
> 
> ___
> hpx-users mailing list
> hpx-users@stellar.cct.lsu.edu
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

___
hpx-users mailing list
hpx-users@stellar.cct.lsu.edu
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users


[hpx-users] HPX 1.1.0 preparation

2018-01-09 Thread Simberg Mikael
Dear HPX users and developers,

We are preparing to release HPX 1.1.0 and are aiming for a release candidate by 
January 31st and a release two weeks later on February 14th. Changes still go 
into master as usual until the relase candidate.

We still have quite a few open issues for the release: 
https://github.com/STEllAR-GROUP/hpx/issues?page=1=is%3Aopen+is%3Aissue+milestone%3A1.1.0.
 Please have a look and comment if you can work on any of them, if you think 
they are not relevant anymore or can be moved to future releases.

You can help us test HPX by running your applications with the latest master 
branch. If you encounter any issues report them here: 
github.com/ste||ar-group/hpx/issues.

If there are examples or benchmarks that you'd like to see added to or removed 
from HPX for the release comment here: 
https://github.com/STEllAR-GROUP/hpx/issues/3049.

Finally, if you have been on the fence about contributing to HPX, I'd like to 
remind you that every little helps. Writing documentation is just as important 
as implementing new features!

Kind regards,
Mikael Simberg

___
hpx-users mailing list
hpx-users@stellar.cct.lsu.edu
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users