Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 21:35:32 -0400
Kent Overstreet  wrote:

> On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> > If ftrace, perf, bpf can't do what you want, take a harder look to see if
> > you can modify them to do so.  
> 
> Maybe we can use this exchange to make both of our tools better. I like your
> histograms - the quantiles algorithm I've had for years is janky, I've been
> meaning to rip that out, I'd love to take a look at your code for that. And
> having an on/off switch is a good idea, I'll try to add that at some point.
> Maybe you got some ideas from my stuff too.
> 
> I'd love to get better tracepoints for measuring latency - what I added to
> init_wait() and finish_wait() was really only a starting point. Figuring out
> the right places to measure is where I'd like to be investing my time in this
> area, and there's no reason we couldn't both be making use of that.

Yes, this is exactly what I'm talking about. I'm not against your work, I
just want you to work more with everyone to come up with ideas that can
help everyone as a whole. That's how "open source communities" is suppose
to work ;-)

The histogram and synthetic events can use some more clean ups. There's a
lot of places that can be improved in that code. But I feel the ideas
behind that code is sound. It's just getting the implementation to be a bit
more efficient.

> 
> e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but 
> not
> all code uses that, init_wait() got me better coverage. But I've already seen
> that that misses things, too, there's more work to be done.

I picked prepare_to_wait() just because I was hacking up something quick
and thought that was "close enough" ;-)

-- Steve




Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> If ftrace, perf, bpf can't do what you want, take a harder look to see if
> you can modify them to do so.

Maybe we can use this exchange to make both of our tools better. I like your
histograms - the quantiles algorithm I've had for years is janky, I've been
meaning to rip that out, I'd love to take a look at your code for that. And
having an on/off switch is a good idea, I'll try to add that at some point.
Maybe you got some ideas from my stuff too.

I'd love to get better tracepoints for measuring latency - what I added to
init_wait() and finish_wait() was really only a starting point. Figuring out
the right places to measure is where I'd like to be investing my time in this
area, and there's no reason we couldn't both be making use of that.

e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not
all code uses that, init_wait() got me better coverage. But I've already seen
that that misses things, too, there's more work to be done.

random thought: might try adding a warning in schedule() any time it's called
and codetag_time_stats_start() hasn't been called, that'll be a starting
point...



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 18:55:15 -0400
Kent Overstreet  wrote:

> On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> > On Thu, 1 Sep 2022 17:54:38 -0400
> > Kent Overstreet  wrote:  
> > > 
> > > So this looks like it's gotten better since I last looked, but it's still 
> > > not
> > > there yet.
> > > 
> > > Part of the problem is that the tracepoints themselves are in the wrong 
> > > place:
> > > your end event is when a task is woken up, but that means spurious 
> > > wakeups will  
> > 
> > The end event is when a task is scheduled onto the CPU. The start event is
> > the first time it is woken up.  
> 
> Yeah, that's not what I want. You're just tracing latency due to having more
> processes runnable than CPUs.
> 
> I don't care about that for debugging, though! I specifically want latency at
> the wait_event() level, and related - every time a process blocked _on some
> condition_, until that condition became true. Not until some random, 
> potentially
> spurious wakeup.

Ideally this would be better if we could pass the stack trace from one
event to the next, but that wouldn't be too hard to implement. It just
needs to be done.

But anyway:

 # echo 'p:wait prepare_to_wait_event' > /sys/kernel/tracing/kprobe_events

// created an event on prepare_to_wait_event as that's usually called just
// before wait event.

 # sqlhist -e -n wait_sched 'select start.common_pid as 
pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from wait as start 
join sched_switch as end on start.common_pid = end.prev_pid where 
end.prev_state & 3'

// Create a "wait_sched" event that traces the time between the
// prepare_to_wait_event call and the scheduler. Only trigger it if the
// schedule happens in the interruptible or uninterruptible states.

 # sqlhist -e -n wake_sched 'select start.pid,(end.TIMESTAMP_USECS - 
start.TIMESTAMP_USECS) as delta2 from wait_sched as start join sched_switch as 
end on start.pid = end.next_pid where start.delta < 50'

// Now attach the wait_event to the sched_switch where the task gets
// scheduled back in. But we are only going to care if the delta between
// the prepare_to_wait_event and the schedule is less that 50us. This is a
// hack to just care about where a prepare_to_wait_event was done just before
// scheduling out.

 # echo 'hist:keys=pid,delta2.buckets=10:sort=delta2' > 
/sys/kernel/tracing/events/synthetic/wake_sched/trigger

// Now we are going to look at the deltas that the task was sleeping for an
// event. But this just gives pids and deltas.

 # echo 'hist:keys=pid,stacktrace if delta < 50' >> 
/sys/kernel/tracing/events/synthetic/wait_sched/trigger

// And this is to get the backtraces of where the task was. This is because
// the stack trace is not available at the schedule in, because the
// sched_switch can only give the stack trace of when a task schedules out.
// Again, this is somewhat a hack.

 # cat /sys/kernel/tracing/events/synthetic/wake_sched/hist
# event histogram
#
# trigger info: 
hist:keys=pid,delta2.buckets=10:vals=hitcount:sort=delta2.buckets=10:size=2048 
[active]
#

{ pid:   2114, delta2: ~ 10-19 } hitcount:  1
{ pid:   1389, delta2: ~ 160-169 } hitcount:  1
{ pid:   1389, delta2: ~ 660-669 } hitcount:  1
{ pid:   1389, delta2: ~ 1020-1029 } hitcount:  1
{ pid:   1189, delta2: ~ 500020-500029 } hitcount:  1
{ pid:   1189, delta2: ~ 500030-500039 } hitcount:  1
{ pid:   1195, delta2: ~ 500030-500039 } hitcount:  2
{ pid:   1189, delta2: ~ 500040-500049 } hitcount: 10
{ pid:   1193, delta2: ~ 500040-500049 } hitcount:  3
{ pid:   1197, delta2: ~ 500040-500049 } hitcount:  3
{ pid:   1195, delta2: ~ 500040-500049 } hitcount:  9
{ pid:   1190, delta2: ~ 500050-500059 } hitcount: 55
{ pid:   1197, delta2: ~ 500050-500059 } hitcount: 51
{ pid:   1191, delta2: ~ 500050-500059 } hitcount: 61
{ pid:   1198, delta2: ~ 500050-500059 } hitcount: 56
{ pid:   1195, delta2: ~ 500050-500059 } hitcount: 48
{ pid:   1192, delta2: ~ 500050-500059 } hitcount: 54
{ pid:   1194, delta2: ~ 500050-500059 } hitcount: 50
{ pid:   1196, delta2: ~ 500050-500059 } hitcount: 57
{ pid:   1189, delta2: ~ 500050-500059 } hitcount: 48
{ pid:   1193, delta2: ~ 500050-500059 } hitcount: 52
{ pid:   1194, delta2: ~ 500060-500069 } hitcount: 12
{ pid:   1191, delta2: ~ 500060-500069 } hitcount:  2
{ pid:   1190, delta2: ~ 500060-500069 } hitcount:  7
{ pid:   1198, delta2: ~ 500060-500069 } hitcount:  9
{ pid:   1193, delta2: ~ 500060-500069 } hitcount:  6
{ pid:   1196, delta2: ~ 500060-500069 } hitcount:  5
{ pid:   1192, delta2: ~ 500060-500069 } hitcount:  9
{ pid:   1197, delta2: ~ 500060-500069 } hitcount:  9
{ pid:   

Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 Sep 2022 17:54:38 -0400
> Kent Overstreet  wrote:
> > 
> > So this looks like it's gotten better since I last looked, but it's still 
> > not
> > there yet.
> > 
> > Part of the problem is that the tracepoints themselves are in the wrong 
> > place:
> > your end event is when a task is woken up, but that means spurious wakeups 
> > will
> 
> The end event is when a task is scheduled onto the CPU. The start event is
> the first time it is woken up.

Yeah, that's not what I want. You're just tracing latency due to having more
processes runnable than CPUs.

I don't care about that for debugging, though! I specifically want latency at
the wait_event() level, and related - every time a process blocked _on some
condition_, until that condition became true. Not until some random, potentially
spurious wakeup.


> Not the prettiest thing to read. But hey, we got the full stack of where
> these latencies happened!

Most of the time I _don't_ want full stacktraces, though!

That means I have a ton more output to sort through, and the data is far more
expensive to collect.

I don't know why it's what people go to first - see the page_owner stuff - but
that doesn't get used much either because the output is _really hard to sort
through_.

Most of the time, just a single file and line number is all you want - and
tracing has always made it hard to get at that.


> Yes, it adds some overhead when the events are triggered due to the
> stacktrace code, but it's extremely useful information.
> 
> > 
> > So, it looks like tracing has made some progress over the past 10 years,
> > but for debugging latency issues it's still not there yet in general. I
> 
> I call BS on that statement. Just because you do not know what has been
> added to the kernel in the last 10 years (like you had no idea about
> seq_buf and that was added in 2014) means to me that you are totally
> clueless on what tracing can and can not do.
> 
> It appears to me that you are too focused on inventing your own wheel that
> does exactly what you want before looking to see how things are today. Just
> because something didn't fit your needs 10 years ago doesn't mean that it
> can't fit your needs today.

...And the ad hominem attacks start.

Steve, I'm not attacking you, and there's room enough in this world for the both
of us to be doing our thing creating new and useful tools.

> I'm already getting complaints from customers/users that are saying there's
> too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The
> idea is to have the tools using mostly the same infrastructure, and not be
> 100% off on its own, unless there's a clear reason to invent a new wheel
> that several people are asking for, not just one or two.

I would like to see more focus on usability.

That means, in a best case scenario, always-on data collection that I can just
look at, and it'll already be in the format most likely to be useful.

Surely you can appreciate the usefulness of that..?

Tracing started out as a tool for efficiently getting lots of data out of the
kernel, and it's great for that. But I think your focus on the cool thing you
built may be blinding you a bit to alternative approaches...



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 17:54:38 -0400
Kent Overstreet  wrote:
> 
> So this looks like it's gotten better since I last looked, but it's still not
> there yet.
> 
> Part of the problem is that the tracepoints themselves are in the wrong place:
> your end event is when a task is woken up, but that means spurious wakeups 
> will

The end event is when a task is scheduled onto the CPU. The start event is
the first time it is woken up.

> cause one wait_event() call to be reported as multiple smaller waits, not one
> long wait - oops, now I can't actually find the thing that's causing my
> multi-second delay.
> 
> Also, in your example you don't have it broken out by callsite. That would be
> the first thing I'd need for any real world debugging.

OK, how about this (currently we can only have 3 keys, but you can create
multiple histograms on the same event).

 # echo 'hist:keys=comm,stacktrace,delta.buckets=10:sort=delta' > 
/sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

(notice the "stacktrace" in the keys)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,stacktrace,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048
 [active]
#

{ comm: migration/2   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/5   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/1   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/7   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/0   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x595/0x5be
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/4   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: rtkit-daemon  , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 preempt_schedule_common+0x2d/0x70
 preempt_schedule_thunk+0x16/0x18
 _raw_spin_unlock_irq+0x2e/0x40
 eventfd_write+0xc8/0x290
 vfs_write+0xc0/0x2a0
 ksys_write+0x5f/0xe0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
, delta: ~ 

Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 05:38:44PM -0400, Steven Rostedt wrote:
> On Tue, 30 Aug 2022 14:49:16 -0700
> Suren Baghdasaryan  wrote:
> 
> > From: Kent Overstreet 
> > 
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> > 
> >   code_tag_time_stats_start(start_time);
> >   code_tag_time_stats_finish(start_time);
> 
> So you need to modify the code to see what you want?

Figuring out the _correct_ place to measure is often a significant amount of the
total effort.

Having done so once, why not annotate that in the source code?

> For function length you could just do something like this:
> 
>  # cd /sys/kernel/tracing
>  # echo __skb_wait_for_more_packets > set_ftrace_filter
>  # echo 1 > function_profile_enabled
>  # cat trace_stat/function*
>   Function   HitTimeAvg   
>   s^2
>      ------   
>   ---
>   __skb_wait_for_more_packets  10.000 us0.000 us  
>   0.000 us
>   Function   HitTimeAvg   
>   s^2
>      ------   
>   ---
>   __skb_wait_for_more_packets  174.813 us   74.813 us 
>   0.000 us
>   Function   HitTimeAvg   
>   s^2
>      ------   
>   ---
>   Function   HitTimeAvg   
>   s^2
>      ------   
>   ---
> 
> The above is for a 4 CPU machine. The s^2 is the square of the standard
> deviation (makes not having to do divisions while it runs).
> 
> But if you are looking for latency between two events (which can be kprobes
> too, where you do not need to rebuild your kernel):
> 
> From: https://man.archlinux.org/man/sqlhist.1.en
> which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
>   if not already installed on your distro.
> 
>  # sqlhist -e -n wakeup_lat 'select end.next_comm as 
> comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as 
> delta from sched_waking as start join sched_switch as end on start.pid = 
> end.next_pid where start.prio < 100'
> 
> The above creates a synthetic event called "wakeup_lat" that joins two
> events (sched_waking and sched_switch) when the pid field of sched_waking
> matches the next_pid field of sched_switch. When there is a match, it will
> trigger the wakeup_lat event only if the prio of the sched_waking event is
> less than 100 (which in the kernel means any real-time task). The
> wakeup_lat event will record the next_comm (as comm field), the pid of
> woken task and the time delta in microseconds between the two events.

So this looks like it's gotten better since I last looked, but it's still not
there yet.

Part of the problem is that the tracepoints themselves are in the wrong place:
your end event is when a task is woken up, but that means spurious wakeups will
cause one wait_event() call to be reported as multiple smaller waits, not one
long wait - oops, now I can't actually find the thing that's causing my
multi-second delay.

Also, in your example you don't have it broken out by callsite. That would be
the first thing I'd need for any real world debugging.

So, it looks like tracing has made some progress over the past 10 years, but
for debugging latency issues it's still not there yet in general. I will
definitely remember function latency tracing the next time I'm doing performance
work, but I expect that to be far too heavy to enable on a live server.

This thing is only a couple hundred lines of code though, so perhaps tracing
shouldn't be the only tool in our toolbox :)



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 17:38:44 -0400
Steven Rostedt  wrote:

>  # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > 
> /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above could almost be done with sqlhist (but I haven't implemented
"buckets=10" yet because that's a new feature. But for now, let's do log2):

 # sqlhist -e 'select comm,prio,cast(delta as log2) from wakeup_lat'

("-e" is to execute the command, as it normally only displays what commands
need to be run to create the synthetic events and histograms)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,prio,delta.log2:vals=hitcount:sort=hitcount:size=2048 [active]
#

{ comm: migration/4   , prio:  0, 
delta: ~ 2^5  } hitcount:  1
{ comm: migration/0   , prio:  0, 
delta: ~ 2^4  } hitcount:  2
{ comm: rtkit-daemon  , prio:  0, 
delta: ~ 2^7  } hitcount:  2
{ comm: rtkit-daemon  , prio:  0, 
delta: ~ 2^6  } hitcount:  4
{ comm: migration/0   , prio:  0, 
delta: ~ 2^5  } hitcount:  8
{ comm: migration/4   , prio:  0, 
delta: ~ 2^4  } hitcount:  9
{ comm: migration/2   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/5   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/7   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/1   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/6   , prio:  0, 
delta: ~ 2^4  } hitcount: 10

Totals:
Hits: 76
Entries: 11
Dropped: 0


-- Steve



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Tue, 30 Aug 2022 14:49:16 -0700
Suren Baghdasaryan  wrote:

> From: Kent Overstreet 
> 
> This adds the ability to easily instrument code for measuring latency.
> To use, add the following to calls to your code, at the start and end of
> the event you wish to measure:
> 
>   code_tag_time_stats_start(start_time);
>   code_tag_time_stats_finish(start_time);

So you need to modify the code to see what you want?

> 
> Stastistics will then show up in debugfs under
> /sys/kernel/debug/time_stats, listed by file and line number.
> 
> Stastics measured include weighted averages of frequency, duration, max
> duration, as well as quantiles.
> 
> This patch also instruments all calls to init_wait and finish_wait,
> which includes all calls to wait_event. Example debugfs output:
> 
> fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync
> count:  17
> rate:   0/sec
> frequency:  2 sec
> avg duration:   10 us
> max duration:   232 us
> quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
> 
> lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait
> count:  3
> rate:   0/sec
> frequency:  4 sec
> avg duration:   4 sec
> max duration:   4 sec
> quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 
> 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 
> 5360836048 5360836048
> 
> net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets
> count:  10
> rate:   1/sec
> frequency:  859 ms
> avg duration:   472 ms
> max duration:   30 sec
> quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 
> 17217 17217 17217 17217

For function length you could just do something like this:

 # cd /sys/kernel/tracing
 # echo __skb_wait_for_more_packets > set_ftrace_filter
 # echo 1 > function_profile_enabled
 # cat trace_stat/function*
  Function   HitTimeAvg 
s^2
     ------ 
---
  __skb_wait_for_more_packets  10.000 us0.000 us
0.000 us
  Function   HitTimeAvg 
s^2
     ------ 
---
  __skb_wait_for_more_packets  174.813 us   74.813 us   
0.000 us
  Function   HitTimeAvg 
s^2
     ------ 
---
  Function   HitTimeAvg 
s^2
     ------ 
---

The above is for a 4 CPU machine. The s^2 is the square of the standard
deviation (makes not having to do divisions while it runs).

But if you are looking for latency between two events (which can be kprobes
too, where you do not need to rebuild your kernel):

From: https://man.archlinux.org/man/sqlhist.1.en
which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
  if not already installed on your distro.

 # sqlhist -e -n wakeup_lat 'select end.next_comm as 
comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as 
delta from sched_waking as start join sched_switch as end on start.pid = 
end.next_pid where start.prio < 100'

The above creates a synthetic event called "wakeup_lat" that joins two
events (sched_waking and sched_switch) when the pid field of sched_waking
matches the next_pid field of sched_switch. When there is a match, it will
trigger the wakeup_lat event only if the prio of the sched_waking event is
less than 100 (which in the kernel means any real-time task). The
wakeup_lat event will record the next_comm (as comm field), the pid of
woken task and the time delta in microseconds between the two events.

 # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > 
/sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above starts a histogram tracing the name of the woken task, the
priority and the delta (but placing the delta in buckets of size 10, as we
do not need to see every latency number).

 # chrt -f 20 sleep 1

Force something to be woken up that is interesting.

 # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,prio,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048
 [active]
#

{ comm: migration/5   , prio:  0, 
delta: ~ 10-19 } hitcount:  1
{ comm: migration/2   , prio:  0, 
delta: ~ 10-19 } hitcount:  1
{ comm: sleep , prio: 79, 
delta: ~ 10-19 } hitcount:  1
{ comm: migration/7   , prio:  0, 
delta: ~ 10-19 } hitcount:  

Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 09:11:17AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote:
> > From: Kent Overstreet 
> > 
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> > 
> >   code_tag_time_stats_start(start_time);
> >   code_tag_time_stats_finish(start_time);
> > 
> > Stastistics will then show up in debugfs under
> > /sys/kernel/debug/time_stats, listed by file and line number.
> > 
> > Stastics measured include weighted averages of frequency, duration, max
> > duration, as well as quantiles.
> > 
> > This patch also instruments all calls to init_wait and finish_wait,
> > which includes all calls to wait_event. Example debugfs output:
> 
> How can't you do this with a simple eBPF script on top of
> trace_sched_stat_* and friends?

I know about those tracepoints, and I've never found them to be usable. I've
never succesfully used them for debugging latency issues, or known anyone who
has.

And an eBPF script to do everything this does wouldn't be simple at all.
Honesly, the time stats stuff looks _far_ simpler to me than anything involving
tracing - and with tracing you have to correlate the start and end events after
the fact.



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Peter Zijlstra
On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote:
> From: Kent Overstreet 
> 
> This adds the ability to easily instrument code for measuring latency.
> To use, add the following to calls to your code, at the start and end of
> the event you wish to measure:
> 
>   code_tag_time_stats_start(start_time);
>   code_tag_time_stats_finish(start_time);
> 
> Stastistics will then show up in debugfs under
> /sys/kernel/debug/time_stats, listed by file and line number.
> 
> Stastics measured include weighted averages of frequency, duration, max
> duration, as well as quantiles.
> 
> This patch also instruments all calls to init_wait and finish_wait,
> which includes all calls to wait_event. Example debugfs output:

How can't you do this with a simple eBPF script on top of
trace_sched_stat_* and friends?



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-08-31 Thread Suren Baghdasaryan
On Tue, Aug 30, 2022 at 6:53 PM Randy Dunlap  wrote:
>
>
>
> On 8/30/22 14:49, Suren Baghdasaryan wrote:
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index b7d03afbc808..b0f86643b8f0 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -1728,6 +1728,14 @@ config LATENCYTOP
> > Enable this option if you want to use the LatencyTOP tool
> > to find out which userspace is blocking on what kernel operations.
> >
> > +config CODETAG_TIME_STATS
> > + bool "Code tagging based latency measuring"
> > + depends on DEBUG_FS
> > + select TIME_STATS
> > + select CODE_TAGGING
> > + help
> > +   Enabling this option makes latency statistics available in debugfs
>
> Missing period at the end of the sentence.

Ack.

>
> --
> ~Randy



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-08-30 Thread Randy Dunlap



On 8/30/22 14:49, Suren Baghdasaryan wrote:
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index b7d03afbc808..b0f86643b8f0 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1728,6 +1728,14 @@ config LATENCYTOP
> Enable this option if you want to use the LatencyTOP tool
> to find out which userspace is blocking on what kernel operations.
>  
> +config CODETAG_TIME_STATS
> + bool "Code tagging based latency measuring"
> + depends on DEBUG_FS
> + select TIME_STATS
> + select CODE_TAGGING
> + help
> +   Enabling this option makes latency statistics available in debugfs

Missing period at the end of the sentence.

-- 
~Randy