stop

Bill Fischofer Mon, 09 May 2016 14:06:58 -0700

On Mon, May 9, 2016 at 1:50 AM, Yi He <yi...@linaro.org> wrote:

> Hi, Bill
>
> Thanks very much for your detailed explanation. I understand the
> programming practise like:
>
> /* Firstly developer got a chance to specify core availabilities to the
>  * application instance.
>  */
> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... )
>
> *So It is possible to run an application with different core
> availabilities spec on different platform, **and possible to run multiple
> application instances on one platform in isolation.*
>
> *A: Make the above as command line parameters can help application binary
> portable, run it on platform A or B requires no re-compilation, but only
> invocation parameters change.*
>


The intent behind the ability to specify cpumasks at odp_init_global() time
is to allow a launcher script that is configured by some provisioning agent
(e.g., OpenDaylight) to communicate core assignments down to the ODP
implementation in a platform-independent manner.  So applications will fall
into two categories, those that have provisioned coremasks that simply get
passed through and more "stand alone" applications that will us
odp_cpumask_all_available() and odp_cpumask_default_worker/control() as
noted earlier to size themselves dynamically to the available processing
resources.  In both cases there is no need to recompile the application but
rather to simply have it create an appropriate number of control/worker
threads as determined either by external configuration or inquiry.


>
> /* Application developer fanout worker/control threads depends on
>  * the needs and actual availabilities.
>  */
> actually_available_cores =
>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);
>
> iterator( actually_available_cores ) {
>
>     /* Fanout one work thread instance */
>     odph_linux_pthread_create(...upon one available core...);
> }
>
> *B: Is odph_linux_pthread_create() a temporary helper API and will
> converge into platform-independent odp_thread_create(..one core spec...) in
> future? Or, is it deliberately left as platform dependant helper API?*
>
> Based on above understanding and back to ODP-427 problem, which seems only
> the main thread (program entrance) was accidentally not pinned on one core
> :), the main thread is also an ODP_THREAD_CONTROL, but was not instantiated
> through odph_linux_pthread_create().
>

ODP provides no APIs or helpers to control thread pinning. The only
controls ODP provides is the ability to know the number of available cores,
to partition them for use by worker and control threads, and the ability
(via helpers) to create a number of threads of the application's choosing.
The implementation is expected to schedule these threads to available cores
in a fair manner, so if the number of application threads is less than or
equal to the available number of cores then implementations SHOULD (but are
not required to) pin each thread to its own core. Applications SHOULD NOT
be designed to require or depend on any specify thread-to-core mapping both
for portability as well as because what constitutes a "core" in a virtual
environment may or may not represent dedicated hardware.


>
> A solution can be: in odp_init_global() API, after
> odp_cpumask_init_global(), pin the main thread to the 1st available core
> for control thread. This adds new behavioural specification to this API,
> but seems natural. Actually Ivan's patch did most of this, except that the
> core was fixed to 0. we can discuss in today's meeting.
>

An application may consist of more than a single thread at the time it
calls odp_init_global(), however it is RECOMMENDED that odp_init_global()
be called only from the application's initial thread and before it creates
any other threads to avoid the address space confusion that has been the
subject of the past couple of ARCH calls and that we are looking to achieve
consensus on. I'd like to move that discussion to a separate discussion
thread from this one, if you don't mind.


>
> Thanks and Best Regards, Yi
>
> On 6 May 2016 at 22:23, Bill Fischofer <bill.fischo...@linaro.org> wrote:
>
>> These are all good questions. ODP divides threads into worker threads and
>> control threads. The distinction is that worker threads are supposed to be
>> performance sensitive and perform optimally with dedicated cores while
>> control threads perform more "housekeeping" functions and would be less
>> impacted by sharing cores.
>>
>> In the absence of explicit API calls, it is unspecified how an ODP
>> implementation assigns threads to cores. The distinction between worker and
>> control thread is a hint to the underlying implementation that should be
>> used in managing available processor resources.
>>
>> The APIs in cpumask.h enable applications to determine how many CPUs are
>> available to it and how to divide them among worker and control threads
>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note
>> that ODP does not provide APIs for setting specific threads to specific
>> CPUs, so keep that in mind in the answers below.
>>
>>
>> On Thu, May 5, 2016 at 7:59 AM, Yi He <yi...@linaro.org> wrote:
>>
>>> Hi, thanks Bill
>>>
>>> I understand more deeply of ODP thread concept and in embedded app
>>> developers are involved in target platform tuning/optimization.
>>>
>>> Can I have a little example: say we have a data-plane app which includes
>>> 3 ODP threads. And would like to install and run it upon 2 platforms.
>>>
>>>    - Platform A: 2 cores.
>>>    - Platform B: 10 cores
>>>
>>> During initialization, the application can use
>> odp_cpumask_all_available() to determine how many CPUs are available and
>> can (optionally) use odp_cpumask_default_worker() and
>> odp_cpumask_default_control() to divide them into CPUs that should be used
>> for worker and control threads, respectively. For an application designed
>> for scale-out, the number of available CPUs would typically be used to
>> control how many worker threads the application creates. If the number of
>> worker threads matches the number of worker CPUs then the ODP
>> implementation would be expected to dedicate a worker core to each worker
>> thread. If more threads are created than there are corresponding cores,
>> then it is up to each implementation as to how it multiplexes them among
>> the available cores in a fair manner.
>>
>>
>>> Question, which one of the below assumptions is the current ODP
>>> programming model?
>>>
>>> *1, *Application developer writes target platform specific code to tell
>>> that:
>>>
>>> On platform A run threads (0) on core (0), and threads (1,2) on core (1).
>>> On platform B run threads (0) on core (0), and threads (1) can scale out
>>> and duplicate 8 instances on core (1~8), and thread (2) on core (9).
>>>
>>
>> As noted, ODP does not provide APIs that permit specific threads to be
>> assigned to specific cores. Instead it is up to each ODP implementation as
>> to how it maps ODP threads to available CPUs, subject to the advisory
>> information provided by the ODP thread type and the cpumask assignments for
>> control and worker threads. So in these examples suppose what the
>> application has is two control threads and one or more workers.  For
>> Platform A you might have core 0 defined for control threads and Core 1 for
>> worker threads. In this case threads 0 and 1 would run on Core 0 while
>> thread 2 ran on Core 1. For Platform B it's again up to the application how
>> it wants to divide the 10 CPUs between control and worker. It may want to
>> have 2 control CPUs so that each control thread can have its own core,
>> leaving 8 worker threads, or it might have the control threads share a
>> single CPU and have 9 worker threads with their own cores.
>>
>>
>>>
>>>
>> Install and run on different platform requires above platform specific
>>> code and recompilation for target.
>>>
>>
>> No. As noted, the model is the same. The only difference is how many
>> control/worker threads the application chooses to create based on the
>> information it gets during initialization by odp_cpumask_all_available().
>>
>>
>>>
>>> *2, *Application developer writes code to specify:
>>>
>>> Threads (0, 2) would not scale out
>>> Threads (1) can scale out (up to a limit N?)
>>> Platform A has 3 cores available (as command line parameter?)
>>> Platform B has 10 cores available (as command line parameter?)
>>>
>>> Install and run on different platform may not requires re-compilation.
>>> ODP intelligently arrange the threads according to the information
>>> provided.
>>>
>>
>> Applications determine the minimum number of threads they require. For
>> most applications they would tend to have a fixed number of control threads
>> (based on the application's functional design) and a variable number of
>> worker threads (minimum 1) based on available processing resources. These
>> application-defined minimums determine the minimum configuration the
>> application might need for optimal performance, with scale out to larger
>> configurations performed automatically.
>>
>>
>>>
>>> Last question: in some case like power save mode available cores shrink
>>> would ODP intelligently re-arrange the ODP threads dynamically in runtime?
>>>
>>
>> The intent is that while control threads may have distinct roles and
>> responsibilities (thus requiring that all always be eligible to be
>> scheduled) worker threads are symmetric and interchangeable. So in this
>> case if I have N worker threads to match to the N available worker CPUs and
>> power save mode wants to reduce that number to N-1, then the only effect is
>> that the worker CPU entering power save mode goes dormant along with the
>> thread that is running on it. That thread isn't redistributed to some other
>> core because it's the same as the other worker threads.  Its is expected
>> that cores would only enter power save state at odp_schedule() boundaries.
>> So for example, if odp_schedule() determines that there is no work to
>> dispatch to this thread then that might trigger the associated CPU to enter
>> low power mode. When later that core wakes up odp_schedule() would continue
>> and then return work to its reactivated thread.
>>
>> A slight wrinkle here is the concept of scheduler groups, which allows
>> work classes to be dispatched to different groups of worker threads.  In
>> this case the implementation might want to take scheduler group membership
>> into consideration in determining which cores to idle for power savings.
>> However, the ODP API itself is silent on this subject as it is
>> implementation dependent how power save modes are managed.
>>
>>
>>>
>>> Thanks and Best Regards, Yi
>>>
>>
>> Thank you for these questions. I answering them I realized we do not
>> (yet) have this information covered in the ODP User Guide. I'll be using
>> this information to help fill in that gap.
>>
>>
>>>
>>> On 5 May 2016 at 18:50, Bill Fischofer <bill.fischo...@linaro.org>
>>> wrote:
>>>
>>>> I've added this to the agenda for Monday's call, however I suggest we
>>>> continue the dialog here as well as background.
>>>>
>>>> Regarding thread pinning, there's always been a tradeoff on that.  On
>>>> the one hand dedicating cores to threads is ideal for scale out in many
>>>> core systems, however ODP does not require many core environments to work
>>>> effectively, so ODP APIs enable but do not require or assume that cores are
>>>> dedicated to threads. That's really a question of application design and
>>>> fit to the particular platform it's running on. In embedded environments
>>>> you'll likely see this model more since the application knows which
>>>> platform it's being targeted for. In VNF environments, by contrast, you're
>>>> more likely to see a blend where applications will take advantage of
>>>> however many cores are available to it but will still run without dedicated
>>>> cores in environments with more modest resources.
>>>>
>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi...@linaro.org> wrote:
>>>>
>>>>> Hi, thanks Mike and Bill,
>>>>>
>>>>> From your clear summarize can we put it into several TO-DO decisions:
>>>>> (we can have a discussion in next ARCH call):
>>>>>
>>>>>    1. How to addressing the precise semantics of the existing timing
>>>>>    APIs (odp_cpu_xxx) as they relate to processor locality.
>>>>>
>>>>>
>>>>>    - *A:* guarantee by adding constraint to ODP thread concept: every
>>>>>    ODP thread shall be deployed and pinned on one CPU core.
>>>>>       - A sub-question: my understanding is that application
>>>>>       programmers only need to specify available CPU sets for 
>>>>> control/worker
>>>>>       threads, and it is ODP to arrange the threads onto each CPU core 
>>>>> while
>>>>>       launching, right?
>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU
>>>>>    migration.
>>>>>    - Then document clearly in user's guide or API document.
>>>>>
>>>>>
>>>>>    1. Understand the requirement to have both processor-local and
>>>>>    system-wide timing APIs:
>>>>>
>>>>>
>>>>>    - There are some APIs available in time.h (odp_time_local(), etc).
>>>>>    - We can have a thread to understand the relationship, usage
>>>>>    scenarios and constraints of APIs in time.h and cpu.h.
>>>>>
>>>>> Best Regards, Yi
>>>>>
>>>>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischo...@linaro.org>
>>>>> wrote:
>>>>>
>>>>>> I think there are two fallouts form this discussion.  First, there is
>>>>>> the question of the precise semantics of the existing timing APIs as they
>>>>>> relate to processor locality. Applications such as profiling tests, to 
>>>>>> the
>>>>>> extent that they APIs that have processor-local semantics, must ensure 
>>>>>> that
>>>>>> the thread(s) using these APIs are pinned for the duration of the
>>>>>> measurement.
>>>>>>
>>>>>> The other point is the one that Petri brought up about having other
>>>>>> APIs that provide timing information based on wall time or other metrics
>>>>>> that are not processor-local.  While these may not have the same
>>>>>> performance characteristics, they would be independent of thread 
>>>>>> migration
>>>>>> considerations.
>>>>>>
>>>>>> Of course all this depends on exactly what one is trying to measure.
>>>>>> Since thread migration is not free, allowing such activity may or may not
>>>>>> be relevant to what is being measured, so ODP probably wants to have both
>>>>>> processor-local and systemwide timing APIs.  We just need to be sure they
>>>>>> are specified precisely so that applications know how to use them 
>>>>>> properly.
>>>>>>
>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.hol...@linaro.org>
>>>>>> wrote:
>>>>>>
>>>>>>> It sounded like the arch call was leaning towards documenting that
>>>>>>> on odp-linux  the application must ensure that odp_threads are pinned to
>>>>>>> cores when launched.
>>>>>>> This is a restriction that some platforms may not need to make, vs
>>>>>>> the idea that a piece of ODP code can use these APIs to ensure the 
>>>>>>> behavior
>>>>>>> it needs without knowledge or reliance on the wider system.
>>>>>>>
>>>>>>> On 4 May 2016 at 01:45, Yi He <yi...@linaro.org> wrote:
>>>>>>>
>>>>>>>> Establish a performance profiling environment guarantees meaningful
>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()
>>>>>>>> APIs.
>>>>>>>> While after profiling was done restore the execution environment to
>>>>>>>> its multi-core optimized state.
>>>>>>>>
>>>>>>>> Signed-off-by: Yi He <yi...@linaro.org>
>>>>>>>> ---
>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 31 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h
>>>>>>>> index 2789511..0bc9327 100644
>>>>>>>> --- a/include/odp/api/spec/cpu.h
>>>>>>>> +++ b/include/odp/api/spec/cpu.h
>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {
>>>>>>>>
>>>>>>>>
>>>>>>>>  /**
>>>>>>>> + * @typedef odp_profiler_t
>>>>>>>> + * ODP performance profiler handle
>>>>>>>> + */
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * Setup a performance profiling environment
>>>>>>>> + *
>>>>>>>> + * A performance profiling environment guarantees meaningful and
>>>>>>>> consistency of
>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.
>>>>>>>> + *
>>>>>>>> + * @return performance profiler handle
>>>>>>>> + */
>>>>>>>> +odp_profiler_t odp_profiler_start(void);
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>>   * CPU identifier
>>>>>>>>   *
>>>>>>>>   * Determine CPU identifier on which the calling is running. CPU
>>>>>>>> numbering is
>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);
>>>>>>>>  void odp_cpu_pause(void);
>>>>>>>>
>>>>>>>>  /**
>>>>>>>> + * Stop the performance profiling environment
>>>>>>>> + *
>>>>>>>> + * Stop performance profiling and restore the execution
>>>>>>>> environment to its
>>>>>>>> + * multi-core optimized state, won't preserve meaningful and
>>>>>>>> consistency of
>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.
>>>>>>>> + *
>>>>>>>> + * @param profiler  performance profiler handle
>>>>>>>> + *
>>>>>>>> + * @retval 0 on success
>>>>>>>> + * @retval <0 on failure
>>>>>>>> + *
>>>>>>>> + * @see odp_profiler_start()
>>>>>>>> + */
>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>>   * @}
>>>>>>>>   */
>>>>>>>>
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> lng-odp mailing list
>>>>>>>> lng-odp@lists.linaro.org
>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mike Holmes
>>>>>>> Technical Manager - Linaro Networking Group
>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for
>>>>>>> ARM SoCs
>>>>>>> "Work should be fun and collaborative, the rest follows"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> lng-odp mailing list
>>>>>>> lng-odp@lists.linaro.org
>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp

Re: [lng-odp] [API-NEXT, RFC, 1/1] api: cpu: performance profiling start/stop

Reply via email to