Re: ftrace performance (sched events): cyclictest shows 25% more latency

Steven Rostedt Fri, 06 Jul 2018 14:26:33 -0700

On Fri, 6 Jul 2018 08:22:01 +0200
Claudio <[email protected]> wrote:


> Hello all,
> 
> I have been experimenting with the idea of leaving ftrace enabled, with sched 
> events,
> on production systems.
> 
> The main concern that I am having at the moment is about the impact on the 
> system.
> Enabling the sched events that I currently need for the tracing application
> seems to slow down context-switches considerably, and make the system less 
> responsive.
> 
> I have tested with cyclictest on the mainline kernel, and noticed an increase 
> of min, avg latencies of around 25%.
> 
> Is this expected?

No it is not. And this is weird. I'm seeing it too.

> 
> Some initial investigation into ftrace seems to point at the reservation and 
> commit of the events into the ring buffer
> as the highest sources of overhead, while event parameters copying, including 
> COMM, does not seem to have any noticeable effect
> relative to those costs.

Did you use perf to see where the issues were? If so what perf command
did you use?

When running just:
  perf record ./cyclictest --smp -p95 -m -s -N -l 1000000 -q

I get this:

   6.08%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
   4.03%  cyclictest  cyclictest          [.] timerthread
   3.97%  cyclictest  [kernel.vmlinux]    [k] __schedule
   3.81%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
   3.13%  cyclictest  [kernel.vmlinux]    [k] trace_save_cmdline
   2.71%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
   2.48%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
   2.11%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
   2.03%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
   1.99%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
   1.93%  cyclictest  [vdso]              [.] __vdso_clock_gettime
   1.86%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
   1.84%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
   1.66%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
   1.65%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
   1.59%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
   1.55%  cyclictest  [kernel.vmlinux]    [k] __rb_reserve_next

Which shows there's a lot being done before we even get to the ring
buffer reserve.

One thing that I notice too is that we have trace_save_cmdline that
also gets activated when enabling trace events. Perhaps I should try to
optimize that more.

> 
> I have been running 20 times the following test, and thrown away the first 
> results:
> 
> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q
> 
> $ uname -a
> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> For brevity, this is a comparison of one test's results. All other test 
> results show the same ~25% increase.
> 
> On the left side, the run without ftrace sched events, on the right side with 
> ftrace sched events enabled.
> 
> CPU    Count      Min       Act        Avg        Max        Count      
> Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
> 0     100000      2339       2936       2841     139478     100000       2900 
>       3182       3566      93056
> 1      66742      2365       3386       2874      93639      66750       2959 
>       3786       3646     154074
> 2      50080      2376       3058       2910     196221      50097       2997 
>       4209       3655      18707
> 3      40076      2394       3461       2931      17914      40091       3006 
>       4417       3750      17159
> 4      33404      2371       3612       2834      15336      33419       2997 
>       3836       3594      23172
> 5      28635      2387       3313       2885      25863      28649       2995 
>       3795       3647       9956
> 6      25058      2384       3428       2968      12162      25071       3051 
>       4366       3719      18151
> 7      22275      2381       2859       2982      10706      22287       3046 
>       5078       3825      10781
> 
> I would be thankful for any advice or comments on this,
> especially with the goal in mind to lower as much as possible the runtime 
> impact on the system.

But still, we use all sorts of trace events when dealing with real
time, and trace events hardly shows anything in the PREEMPT_RT kernel.
Where cyclictest shows 20 microseconds, and sched events hardly touches
that. But I see this weird side effect on a distro kernel too. Maybe
there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT
set :-/

I'll investigate further.

Thanks for the report.

-- Steve

Re: ftrace performance (sched events): cyclictest shows 25% more latency

Reply via email to