Fwd: Re: [RFC 0/6] optimize ctx switch with rb-tree

Alexey Budankov Thu, 11 May 2017 05:13:38 -0700



-------- Forwarded Message --------
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree
Date: Thu, 11 May 2017 14:53:48 +0300
From: Alexey Budankov <alexey.budan...@linux.intel.com>
Organization: Intel Corp.
To: davi...@google.com
CC: alexey.budan...@linux.intel.com

On 02.05.2017 23:59, Budankov, Alexey wrote:

Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

On Wed, Apr 26, 2017 at 3:34 AM, Budankov, Alexey <alexey.budan...@intel.com> 
wrote:

Hi David,

I would like to take over on the patches development relying on your help with 
reviews.


Sounds good.


Hi,

Sorry for long reply due to Russian holidays joined with vacations.

So, I see that event_sched_out() function (4.11.0-rc6+) additionally todisabling an active (PERF_EVENT_STATE_ACTIVE) event in HW also performsupdates of tstamp fields for inactive (PERF_EVENT_STATE_INACTIVE) eventsassigned to "the other" cpus (different from the one that is executingthe function):


static void
event_sched_out(struct perf_event *event,
                  struct perf_cpu_context *cpuctx,
                  struct perf_event_context *ctx)
{
        u64 tstamp = perf_event_time(event);
        u64 delta;

        WARN_ON_ONCE(event->ctx != ctx);
        lockdep_assert_held(&ctx->lock);

        /*
         * An event which could not be activated because of
         * filter mismatch still needs to have its timings
         * maintained, otherwise bogus information is return
         * via read() for time_enabled, time_running:
         */
->   if (event->state == PERF_EVENT_STATE_INACTIVE &&
->       !event_filter_match(event)) {
->           delta = tstamp - event->tstamp_stopped;
->           event->tstamp_running += delta;
->           event->tstamp_stopped = tstamp;
->   }
->
->   if (event->state != PERF_EVENT_STATE_ACTIVE)
->           return;

I suggest moving this updating work to the context ofperf_event_task_tick() callback. It goes per-cpu with 1ms interval bydefault so the inactive events will get their tstamp fields updatedaside of this critical path:


event_sched_out()
    group_sched_out()
        ctx_sched_out()
            perf_rotate_context()
                perf_mux_hrtimer_handler()

This change will shorten performance critical processing to activeevents only as the amount of the events is always HW-limited.

Could you provide me with the cumulative patch set to expedite the ramp up?


This RFC is my latest version. I did not have a good solution on how to solve 
the problem of handling failure of PMUs that share contexts, and to 
activate/inactivate them.

Some things to keep in mind when dealing with task-contexts are:
   1. The number of PMUs is large and growing, iterating over all PMUs may be 
expensive (see https://lkml.org/lkml/2017/1/18/859 ).
   2. event_filter_match in this RFC is only used because I did not find a better ways to 
filter out events with the rb-tree. It would be nice if we wouldn't have to check 
event->cpu != -1 && event->cpu ==
smp_processor_id() and cgroup stuff for every event in task contexts.

Checking an event for cpu affinity will be avoided, at least for thecritical path mentioned above.

   3. I used the inactive events list in this RFC as a cheaper alternative to 
threading the rb-tree but it has the problem that events that are removed due to 
conflict would be placed at the end of the list even if didn't run. I cannot 
recall if that ever happens. > Using this list also causes problem (2.) maybe 
threading the tree isa better alternative?

The simplest RB-tree key for the change being suggested could be like{event->cpu, event->id} so events could be organized into per-cpusub-trees for fast search and enumeration. All software event could begrouped under event->cpu == -1.

   4. Making the key in task-events to be {PMU,CPU,last_time_scheduled} (as 
opposed to {CPU,last_time_scheduled} in the RFC) may simplify sched in by 
helping to iterate over all events in same PMU at once, simplifying the 
activation/inactivation of the PMU and making it simple to move to the next PMU 
on pmu::add errors. The problem with this approach is to find only the PMUs 
with inactive events without traversing a list of all PMUs. Maybe a per-context 
list of active PMUs may help (see 1.).

cpu-contexts are much simpler and I think work well with what the RFC does 
(they are per-pmu already).

This thread has Peter and Mark's original discussion of the rb-tree 
(https://patchwork.kernel.org/patch/9176121/).

Thanks,
David


What do you think?

Thanks,
Alexey

Fwd: Re: [RFC 0/6] optimize ctx switch with rb-tree

Reply via email to