from:"Steven Rostedt"

Re: [PATCH v2] net: Add trace events for all receive exit points

2018-11-12 Thread Steven Rostedt

On Mon, 12 Nov 2018 15:46:53 -0500 (EST)
Mathieu Desnoyers  wrote:

> I also notice that in two cases, a "gro_result_t" is implicitly cast
> to "int". I usually frown upon this kind of stuff, because it's asking
> for trouble if gro_result_t typedef to something else than "int" in the
> future.
> 
> I would recommend going for two templates, one which takes a "int"
> ret parameter, and the other a "gro_result_t" ret parameter.
> 
> Or am I being too cautious ?

That's more of a question for the netdev maintainers. If they think
casting gro_result_t to int is fine, then I'm fine. If it breaks in the
future, they need to deal with it, I don't ;-)

The downside of two templates, is that the templates are the bloated
part of the trace event (DEFINE_EVENT()s are light weight). They can
add a couple of K to the memory foot print.

-- Steve

Re: [PATCH v2] net: Add trace events for all receive exit points

2018-11-12 Thread Steven Rostedt

On Mon, 12 Nov 2018 15:20:55 -0500 (EST)
Mathieu Desnoyers  wrote:

> 
> Hrm, looking at this again, I notice that there is a single DEFINE_EVENT
> using net_dev_template_simple.
> 
> We could simply turn netif_receive_skb_list_exit into a TRACE_EVENT(),
> remove the net_dev_template_simple, and rename the net_dev_template_return
> to net_dev_template ?

This too is only cosmetic and doesn't affect the code at all, because a
TRACE_EVENT() is really just:

#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
DECLARE_EVENT_CLASS(name,  \
 PARAMS(proto),\
 PARAMS(args), \
 PARAMS(tstruct),  \
 PARAMS(assign),   \
 PARAMS(print));   \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));

-- Steve

> 
> It's pretty clear from the prototype that it expects a "ret" argument,
> so I don't see the need to also state it in the template name.
> 
>

Re: [PATCH v2] net: Add trace events for all receive exit points

2018-11-12 Thread Steven Rostedt

On Mon, 12 Nov 2018 14:44:05 -0500
Geneviève Bastien  wrote:

> Trace events are already present for the receive entry points, to indicate
> how the reception entered the stack.
> 
> This patch adds the corresponding exit trace events that will bound the
> reception such that all events occurring between the entry and the exit
> can be considered as part of the reception context. This greatly helps
> for dependency and root cause analyses.
> 
> Without this, it is impossible to determine whether a sched_wakeup
> event following a netif_receive_skb event is the result of the packet
> reception or a simple coincidence after further processing by the
> thread.
> 
> Signed-off-by: Geneviève Bastien 
> CC: Mathieu Desnoyers 
> CC: Steven Rostedt 
> CC: Ingo Molnar 
> CC: David S. Miller 
> ---
> Changes in v2:
>   - Add the return value to tracepoints where applicable
>   - Verify if tracepoint is enabled before walking list in
> netif_receive_skb_list
> ---
>  include/trace/events/net.h | 78 ++
>  net/core/dev.c | 38 ---
>  2 files changed, 110 insertions(+), 6 deletions(-)
> 
> diff --git a/include/trace/events/net.h b/include/trace/events/net.h
> index 00aa72ce0e7c..cff1a7b9d0bb 100644
> --- a/include/trace/events/net.h
> +++ b/include/trace/events/net.h
> @@ -117,6 +117,42 @@ DECLARE_EVENT_CLASS(net_dev_template,
>   __get_str(name), __entry->skbaddr, __entry->len)
>  )
>  
> +DECLARE_EVENT_CLASS(net_dev_template_return,
> +
> + TP_PROTO(struct sk_buff *skb, int ret),
> +
> + TP_ARGS(skb, ret),
> +
> + TP_STRUCT__entry(
> + __field(void *, skbaddr)
> + __field(int,ret)
> + ),
> +
> + TP_fast_assign(
> + __entry->skbaddr = skb;
> + __entry->ret = ret;
> + ),
> +
> + TP_printk("skbaddr=%p ret=%d", __entry->skbaddr, __entry->ret)
> +)
> +
> +DECLARE_EVENT_CLASS(net_dev_template_simple,
> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb),
> +
> + TP_STRUCT__entry(
> + __field(void *, skbaddr)
> + ),
> +
> + TP_fast_assign(
> + __entry->skbaddr = skb;
> + ),
> +
> + TP_printk("skbaddr=%p", __entry->skbaddr)
> +)
> +
>  DEFINE_EVENT(net_dev_template, net_dev_queue,
>  
>   TP_PROTO(struct sk_buff *skb),
> @@ -244,6 +280,48 @@ DEFINE_EVENT(net_dev_rx_verbose_template, 
> netif_rx_ni_entry,
>   TP_ARGS(skb)
>  );
>  
> +DEFINE_EVENT(net_dev_template_return, napi_gro_frags_exit,
> +
> + TP_PROTO(struct sk_buff *skb, int ret),
> +
> + TP_ARGS(skb, ret)
> +);
> +
> +DEFINE_EVENT(net_dev_template_return, napi_gro_receive_exit,
> +
> + TP_PROTO(struct sk_buff *skb, int ret),
> +
> + TP_ARGS(skb, ret)
> +);
> +
> +DEFINE_EVENT(net_dev_template_return, netif_receive_skb_exit,
> +
> + TP_PROTO(struct sk_buff *skb, int ret),
> +
> + TP_ARGS(skb, ret)
> +);
> +
> +DEFINE_EVENT(net_dev_template_simple, netif_receive_skb_list_exit,

One small nit, and I don't care enough to ask you to fix it, but others
might care. We probably should not intermix net_dev_template_simple
events within net_dev_template_return events.

But like I said, I don't really care, it's more cosmetic than anything
else.

For the tracing side:

Reviewed-by: Steven Rostedt (VMware) 

-- Steve


> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb)
> +);
> +
> +DEFINE_EVENT(net_dev_template_return, netif_rx_exit,
> +
> + TP_PROTO(struct sk_buff *skb, int ret),
> +
> + TP_ARGS(skb, ret)
> +);
> +
> +DEFINE_EVENT(net_dev_template_return, netif_rx_ni_exit,
> +
> + TP_PROTO(struct sk_buff *skb, int ret),
> +
> + TP_ARGS(skb, ret)
> +);
> +
>  #endif /* _TRACE_NET_H */
>  
>

Re: [PATCH] net: Add trace events for all receive exit points

2018-11-09 Thread Steven Rostedt

On Thu,  8 Nov 2018 14:56:48 -0500
Geneviève Bastien  wrote:

> Trace events are already present for the receive entry points, to indicate
> how the reception entered the stack.
> 
> This patch adds the corresponding exit trace events that will bound the
> reception such that all events occurring between the entry and the exit
> can be considered as part of the reception context. This greatly helps
> for dependency and root cause analyses.
> 
> Without this, it is impossible to determine whether a sched_wakeup
> event following a netif_receive_skb event is the result of the packet
> reception or a simple coincidence after further processing by the
> thread.
> 
> Signed-off-by: Geneviève Bastien 
> CC: Mathieu Desnoyers 
> CC: Steven Rostedt 
> CC: Ingo Molnar 
> CC: David S. Miller 
> ---
>  include/trace/events/net.h | 59 ++
>  net/core/dev.c | 30 ---
>  2 files changed, 85 insertions(+), 4 deletions(-)
> 
> diff --git a/include/trace/events/net.h b/include/trace/events/net.h
> index 00aa72ce0e7c..318307511018 100644
> --- a/include/trace/events/net.h
> +++ b/include/trace/events/net.h
> @@ -117,6 +117,23 @@ DECLARE_EVENT_CLASS(net_dev_template,
>   __get_str(name), __entry->skbaddr, __entry->len)
>  )
>  
> +DECLARE_EVENT_CLASS(net_dev_template_simple,
> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb),
> +
> + TP_STRUCT__entry(
> + __field(void *, skbaddr)
> + ),
> +
> + TP_fast_assign(
> + __entry->skbaddr = skb;
> + ),
> +
> + TP_printk("skbaddr=%p", __entry->skbaddr)
> +)
> +
>  DEFINE_EVENT(net_dev_template, net_dev_queue,
>  
>   TP_PROTO(struct sk_buff *skb),
> @@ -244,6 +261,48 @@ DEFINE_EVENT(net_dev_rx_verbose_template, 
> netif_rx_ni_entry,
>   TP_ARGS(skb)
>  );
>  
> +DEFINE_EVENT(net_dev_template_simple, napi_gro_frags_exit,
> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb)
> +);
> +
> +DEFINE_EVENT(net_dev_template_simple, napi_gro_receive_exit,
> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb)
> +);
> +
> +DEFINE_EVENT(net_dev_template_simple, netif_receive_skb_exit,
> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb)
> +);
> +
> +DEFINE_EVENT(net_dev_template_simple, netif_receive_skb_list_exit,
> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb)
> +);
> +
> +DEFINE_EVENT(net_dev_template_simple, netif_rx_exit,
> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb)
> +);
> +
> +DEFINE_EVENT(net_dev_template_simple, netif_rx_ni_exit,
> +
> + TP_PROTO(struct sk_buff *skb),
> +
> + TP_ARGS(skb)
> +);
> +
>  #endif /* _TRACE_NET_H */
>  
>  /* This part must be outside protection */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0ffcbdd55fa9..e670ca27e829 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4520,9 +4520,14 @@ static int netif_rx_internal(struct sk_buff *skb)
>  
>  int netif_rx(struct sk_buff *skb)
>  {
> + int ret;
> +
>   trace_netif_rx_entry(skb);
>  
> - return netif_rx_internal(skb);
> + ret = netif_rx_internal(skb);
> + trace_netif_rx_exit(skb);
> +
> + return ret;

Pretty much all the trace events have a "ret" passed by them, why not
record the ret in the trace event as well?

trace_netif_rx_exit(skb, ret);

>  }
>  EXPORT_SYMBOL(netif_rx);
>  
> @@ -4537,6 +4542,7 @@ int netif_rx_ni(struct sk_buff *skb)
>   if (local_softirq_pending())
>   do_softirq();
>   preempt_enable();
> + trace_netif_rx_ni_exit(skb);
>  
>   return err;
>  }
> @@ -5222,9 +5228,14 @@ static void netif_receive_skb_list_internal(struct 
> list_head *head)
>   */
>  int netif_receive_skb(struct sk_buff *skb)
>  {
> + int ret;
> +
>   trace_netif_receive_skb_entry(skb);
>  
> - return netif_receive_skb_internal(skb);
> + ret = netif_receive_skb_internal(skb);
> + trace_netif_receive_skb_exit(skb);
> +
> + return ret;
>  }
>  EXPORT_SYMBOL(netif_receive_skb);
>  
> @@ -5247,6 +5258,8 @@ void netif_receive_skb_list(struct list_head *head)
>   list_for_each_entry(skb, head, list)
>   trace_netif_receive_skb_list_entry(skb);
>   netif_receive_skb_list_internal(head);
> + list_for_each_entry(skb, head, list)
> + trace_netif_receive_skb_list_exit(skb);

This needs:

if (trace_netif_receive_skb_list_exit_enabled()) {
list_for_each_entry(sk

Re: [v8, bpf-next, 4/9] net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint

2018-05-24 Thread Steven Rostedt

On Thu, 24 May 2018 16:28:39 -0700
Alexei Starovoitov  wrote:

> Ohh. I didn't realize that networking wireless doesn't fall under netdev.
> I thought wireless folks are silent because they are embarrassed
> by a function with 17 arguments.

Please lets refrain from the demeaning comments.

I agree with your argument, but not the tone.

-- Steve

Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-30 Thread Steven Rostedt

On Mon, 30 Apr 2018 09:14:04 -0700
Ben Greear  wrote:

> >> As part of VMware's performance testing with the Linux 4.15 kernel,
> >> we identified CPU cost and throughput regressions when comparing to
> >> the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
> >> send tests when using small message sizes. The regressions are
> >> significant (up 3x) and were tracked down to be a side effect of Eric
> >> Dumazat's RB tree changes that went into the Linux 4.15 kernel.
> >> Further investigation showed our use of the TCP_NODELAY flag in
> >> conjunction with Eric's change caused the regressions to show and
> >> simply disabling TCP_NODELAY brought performance back to normal.
> >> Eric's change also resulted into significant improvements in our
> >> TCP_RR test cases.
> >>
> >>
> >>
> >> Based on these results, our theory is that Eric's change made the
> >> system overall faster (reduced latency) but as a side effect less
> >> aggregation is happening (with TCP_NODELAY) and that results in lower
> >> throughput. Previously even though TCP_NODELAY was set, system was
> >> slower and we still got some benefit of aggregation. Aggregation
> >> helps in better efficiency and higher throughput although it can
> >> increase the latency. If you are seeing a regression in your
> >> application throughput after this change, using TCP_NODELAY might
> >> help bring performance back however that might increase latency.  
> 
> I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?

Yes, thank you for catching that.

-- Steve

Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-27 Thread Steven Rostedt


We'd like this email archived in netdev list, but since netdev is
notorious for blocking outlook email as spam, it didn't go through. So
I'm replying here to help get it into the archives.

Thanks!

-- Steve


On Fri, 27 Apr 2018 23:05:46 +
Michael Wenig  wrote:

> As part of VMware's performance testing with the Linux 4.15 kernel,
> we identified CPU cost and throughput regressions when comparing to
> the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
> send tests when using small message sizes. The regressions are
> significant (up 3x) and were tracked down to be a side effect of Eric
> Dumazat's RB tree changes that went into the Linux 4.15 kernel.
> Further investigation showed our use of the TCP_NODELAY flag in
> conjunction with Eric's change caused the regressions to show and
> simply disabling TCP_NODELAY brought performance back to normal.
> Eric's change also resulted into significant improvements in our
> TCP_RR test cases.
> 
> 
> 
> Based on these results, our theory is that Eric's change made the
> system overall faster (reduced latency) but as a side effect less
> aggregation is happening (with TCP_NODELAY) and that results in lower
> throughput. Previously even though TCP_NODELAY was set, system was
> slower and we still got some benefit of aggregation. Aggregation
> helps in better efficiency and higher throughput although it can
> increase the latency. If you are seeing a regression in your
> application throughput after this change, using TCP_NODELAY might
> help bring performance back however that might increase latency.
> 
> 
> 
> As such, we are not asking for a fix but simply want to document for
> others what we have found.
> 
> 
> 
> Michael Wenig
> 
> Performance Engineering
> 
> VMware, Inc.
>

Re: [PATCH v8 bpf-next 6/9] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-28 Thread Steven Rostedt

On Wed, 28 Mar 2018 12:38:48 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> On 3/28/18 12:34 PM, Steven Rostedt wrote:
> > On Wed, 28 Mar 2018 12:05:37 -0700
> > Alexei Starovoitov <a...@kernel.org> wrote:
> >  
> >> +++ b/include/linux/tracepoint-defs.h
> >> @@ -35,4 +35,10 @@ struct tracepoint {
> >>struct tracepoint_func __rcu *funcs;
> >>  };
> >>
> >> +struct bpf_raw_event_map {
> >> +  struct tracepoint   *tp;
> >> +  void*bpf_func;
> >> +  u32 num_args;
> >> +} __aligned(32);
> >> +  
> >
> > If you prefer v7, I'm fine with that. For cache issues, I can pull out
> > the funcs from the tracepoint structure like I posted.  
> 
> I very much prefer to land this v8 as-is and optimize later.
> 
> I still have bpfilter/microkernel patches to finish which were
> practically ready two weeks ago and got delayed but this set.

Then by all means, you have my Ack. We can optimize later.

For the whole series... (v7 or v8)

Acked-by: Steven Rostedt (VMware) <rost...@goodmis.org>

Thanks!

-- Steve

Re: [PATCH v7 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-28 Thread Steven Rostedt

On Wed, 28 Mar 2018 15:32:20 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> -#define __DO_TRACE(tp, proto, args, cond, rcucheck)  \
> +#define __DO_TRACE(name, proto, args, cond, rcucheck)
> \
>   do {\
>   struct tracepoint_func *it_func_ptr;\
>   void *it_func;  \
> @@ -140,7 +140,7 @@ extern void syscall_unregfunc(void);
>   if (rcucheck)   \
>   rcu_irq_enter_irqson(); \
>   rcu_read_lock_sched_notrace();  \
> - it_func_ptr = rcu_dereference_sched((tp)->funcs);   \
> + it_func_ptr = rcu_dereference_sched(__trace_##name##_funcs); \

What we lose in data size, we may make up for in text (which is even
more important). This will remove a dereference in the hot path.

I'll make a few builds and run size on the vmlinux images to see how
this pans out.

-- Steve


>   if (it_func_ptr) {  \
>   do {\

Re: [PATCH v8 bpf-next 6/9] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-28 Thread Steven Rostedt

On Wed, 28 Mar 2018 12:05:37 -0700
Alexei Starovoitov  wrote:

> +++ b/include/linux/tracepoint-defs.h
> @@ -35,4 +35,10 @@ struct tracepoint {
>   struct tracepoint_func __rcu *funcs;
>  };
>  
> +struct bpf_raw_event_map {
> + struct tracepoint   *tp;
> + void*bpf_func;
> + u32 num_args;
> +} __aligned(32);
> +

If you prefer v7, I'm fine with that. For cache issues, I can pull out
the funcs from the tracepoint structure like I posted.

Mathieu, your thoughts?

-- Steve

Re: [PATCH v7 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-28 Thread Steven Rostedt

On Wed, 28 Mar 2018 15:22:24 -0400 (EDT)
Mathieu Desnoyers  wrote:

> > > cache hot/cold argument clearly doesn't apply.  
> 
> In the current situation I'm fine with adding this extra field
> to struct tracepoint. However, we should keep in mind to move
> all non-required cache-cold fields to a separate section at
> some point. Clearly just this single field won't make a difference
> due to other fields and padding.

funcs is the only part of the tracepoint structure that needs hot
cache. Thus, instead of trying to keep the tracepoint structure small
for cache reasons, pull funcs out of the tracepoint structure.

What about this patch?

Of course, this just adds another 8 bytes per tracepoint :-/ But it
keeps the functions away from the tracepoint structures. We could even
add a section for them to be together, but I'm not sure that will help
much more that just letting the linker place them, as these function
pointers of the same system will probably be grouped together.

-- Steve

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 35db8dd48c4c..8d100163e9af 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -32,7 +32,7 @@ struct tracepoint {
struct static_key key;
int (*regfunc)(void);
void (*unregfunc)(void);
-   struct tracepoint_func __rcu *funcs;
+   struct tracepoint_func **funcs;
u32 num_args;
 };
 
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index c92f4adbc0d7..b55282202f71 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -129,7 +129,7 @@ extern void syscall_unregfunc(void);
  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
  */
-#define __DO_TRACE(tp, proto, args, cond, rcucheck)\
+#define __DO_TRACE(name, proto, args, cond, rcucheck)  \
do {\
struct tracepoint_func *it_func_ptr;\
void *it_func;  \
@@ -140,7 +140,7 @@ extern void syscall_unregfunc(void);
if (rcucheck)   \
rcu_irq_enter_irqson(); \
rcu_read_lock_sched_notrace();  \
-   it_func_ptr = rcu_dereference_sched((tp)->funcs);   \
+   it_func_ptr = rcu_dereference_sched(__trace_##name##_funcs); \
if (it_func_ptr) {  \
do {\
it_func = (it_func_ptr)->func;  \
@@ -158,7 +158,7 @@ extern void syscall_unregfunc(void);
static inline void trace_##name##_rcuidle(proto)\
{   \
if (static_key_false(&__tracepoint_##name.key)) \
-   __DO_TRACE(&__tracepoint_##name,\
+   __DO_TRACE(name,\
TP_PROTO(data_proto),   \
TP_ARGS(data_args), \
TP_CONDITION(cond), 1); \
@@ -181,10 +181,11 @@ extern void syscall_unregfunc(void);
  */
 #define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
extern struct tracepoint __tracepoint_##name;   \
+   extern struct tracepoint_func __rcu *__trace_##name##_funcs;\
static inline void trace_##name(proto)  \
{   \
if (static_key_false(&__tracepoint_##name.key)) \
-   __DO_TRACE(&__tracepoint_##name,\
+   __DO_TRACE(name,\
TP_PROTO(data_proto),   \
TP_ARGS(data_args), \
TP_CONDITION(cond), 0); \
@@ -233,9 +234,11 @@ extern void syscall_unregfunc(void);
 #define DEFINE_TRACE_FN(name, reg, unreg, num_args) \
static const char __tpstrtab_##name[]\
__attribute__((section("__tracepoints_strings"))) = #name;   \
+   struct tracepoint_func __rcu *__trace_##name##_funcs;\
struct tracepoint __tracepoint_##name\
__attribute__((section("__tracepoints"))) =  \
-   { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL,

Re: [PATCH v7 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-28 Thread Steven Rostedt

On Wed, 28 Mar 2018 11:19:34 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> On 3/28/18 11:10 AM, Steven Rostedt wrote:
> > On Wed, 28 Mar 2018 11:03:24 -0700
> > Alexei Starovoitov <a...@fb.com> wrote:
> >  
> >> I can live with this overhead if Mathieu insists,
> >> but I prefer to keep it in 'struct tracepoint'.
> >>
> >> Thoughts?  
> >
> > I'm fine with keeping it as is. We could probably use it for future
> > enhancements in perf and ftrace.
> >
> > Perhaps, we should just add a:
> >
> > #ifdef CONFIG_BPF_EVENTS
> >
> > Around the use cases of num_args.  
> 
> it sounds like a good idea, but implementation wise
> it will be ifdef CONFIG_BPF_EVENTS around u32 num_args;
> in struct tracepoint and similar double definition of
> DEFINE_TRACE_FN. One that uses num_args to init
> struct tracepoint and one that doesn't ?
> Feels like serious uglification of already macros heavy code.
> Also what it will address?

32bit bloat ;-)

But I agree, it's not worth uglifying it.

-- Steve

> cache hot/cold argument clearly doesn't apply.

Re: [PATCH v7 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-28 Thread Steven Rostedt

On Wed, 28 Mar 2018 11:03:24 -0700
Alexei Starovoitov  wrote:

> I can live with this overhead if Mathieu insists,
> but I prefer to keep it in 'struct tracepoint'.
> 
> Thoughts?

I'm fine with keeping it as is. We could probably use it for future
enhancements in perf and ftrace.

Perhaps, we should just add a:

#ifdef CONFIG_BPF_EVENTS

Around the use cases of num_args.

-- Steve

Re: [PATCH v7 bpf-next 07/10] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-28 Thread Steven Rostedt

On Tue, 27 Mar 2018 19:11:02 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> From: Alexei Starovoitov <a...@kernel.org>
> 
> Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
> kernel internal arguments of the tracepoints in their raw form.
> 
> >From bpf program point of view the access to the arguments look like:  
> struct bpf_raw_tracepoint_args {
>__u64 args[0];
> };
> 
> int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
> {
>   // program can read args[N] where N depends on tracepoint
>   // and statically verified at program load+attach time
> }
> 
> kprobe+bpf infrastructure allows programs access function arguments.
> This feature allows programs access raw tracepoint arguments.
> 
> Similar to proposed 'dynamic ftrace events' there are no abi guarantees
> to what the tracepoints arguments are and what their meaning is.
> The program needs to type cast args properly and use bpf_probe_read()
> helper to access struct fields when argument is a pointer.
> 
> For every tracepoint __bpf_trace_##call function is prepared.
> In assembler it looks like:
> (gdb) disassemble __bpf_trace_xdp_exception
> Dump of assembler code for function __bpf_trace_xdp_exception:
>0x81132080 <+0>: mov%ecx,%ecx
>0x81132082 <+2>: jmpq   0x811231f0 
> 
> where
> 
> TRACE_EVENT(xdp_exception,
> TP_PROTO(const struct net_device *dev,
>  const struct bpf_prog *xdp, u32 act),
> 
> The above assembler snippet is casting 32-bit 'act' field into 'u64'
> to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
> All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
> and in total this approach adds 7k bytes to .text.
> 
> This approach gives the lowest possible overhead
> while calling trace_xdp_exception() from kernel C code and
> transitioning into bpf land.
> Since tracepoint+bpf are used at speeds of 1M+ events per second
> this is valuable optimization.
> 
> The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
> that returns anon_inode FD of 'bpf-raw-tracepoint' object.
> 
> The user space looks like:
> // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
> prog_fd = bpf_prog_load(...);
> // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
> raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
> 
> Ctrl-C of tracing daemon or cmdline tool that uses this feature
> will automatically detach bpf program, unload it and
> unregister tracepoint probe.
> 
> On the kernel side the __bpf_raw_tp_map section of pointers to
> tracepoint definition and to __bpf_trace_*() probe function is used
> to find a tracepoint with "xdp_exception" name and
> corresponding __bpf_trace_xdp_exception() probe function
> which are passed to tracepoint_probe_register() to connect probe
> with tracepoint.
> 
> Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
> tracepoint mechanisms. perf_event_open() can be used in parallel
> on the same tracepoint.
> Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
> Each with its own bpf program. The kernel will execute
> all tracepoint probes and all attached bpf programs.
> 
> In the future bpf_raw_tracepoints can be extended with
> query/introspection logic.
> 
> __bpf_raw_tp_map section logic was contributed by Steven Rostedt
> 
> Signed-off-by: Alexei Starovoitov <a...@kernel.org>
> Signed-off-by: Steven Rostedt (VMware) <rost...@goodmis.org>
> ---

Just an FYI, I applied all the patches up to and including this one
(made sure BPF_EVENTS was enabled in my config this time), built and
booted the kernel and ran a bunch of tests (not my full suite, but
enough).

It didn't affect any other tracing features that I can see.

-- Steve

Re: [PATCH v7 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-28 Thread Steven Rostedt

On Wed, 28 Mar 2018 10:10:34 -0700
Alexei Starovoitov  wrote:


> > and have:
> >
> > u64 tp_offset = (u64)tp - (u64)_sdata;
> >
> > if (WARN_ON(tp_offset > UINT_MAX)
> > return -EINVAL;
> >
> >  btp->tp_offset = (u32)tp_offset;  
> 
> above math has to be build time constant, so warn_on likely
> won't work.

Right, it would require a BUILD_BUG_ON.

> imo the whole thing is too fragile and obscure.
> I suggest to compress this 8 bytes * num_of_tracepoints later.
> Especially would be good to do it in one way for
> bpf_raw_event_map, ftrace and other places.

Fair enough. We can defer this shrinkage to another time. I only
suggested it here over your concern for the added bloat.

-- Steve

Re: [PATCH v7 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-28 Thread Steven Rostedt

On Wed, 28 Mar 2018 09:43:56 -0700
Alexei Starovoitov  wrote:
> >
> > Given that only eBPF needs this parameter count, we can move
> > it to the struct bpf_raw_event_map newly introduced by Steven,
> > right ? This would reduce bloat of struct tracepoint. For instance,
> > we don't need to keep this count around when eBPF is configured
> > out.  
> 
> Makes sense. That is indeed cleaner. Will respin.
> 
> What I don't like though is 'bloat' argument.
> 'u32 num_args' padded to 8-byte takes exactly the same amount
> of space in 'struct tracepoint' and in 'struct bpf_raw_event_map'
> The number of these structures is the same as well
> and chances that tracepoints are on while bpf is off are slim.
> More so few emails ago you said:
> "I'm perfectly fine with adding the "num_args" stuff. I think it's
> really useful. It's only the for_each_kernel_tracepoint change for
> which I'm trying to understand the rationale."

I don't really care which one it goes in. The padding bloat is the same
for both :-/  But I wonder if we can shrink it by doing a trick that
Josh did in one of his patches. That is, to use a 32bit offset instead
of a direct pointer. Since you are only accessing core kernel
tracepoints.

Thus, we could have

struct bpf_raw_event_map {
u32 tp_offset;
u32 num_args;
void*bpf_func;
};

and have:

u64 tp_offset = (u64)tp - (u64)_sdata;

if (WARN_ON(tp_offset > UINT_MAX)
return -EINVAL;

 btp->tp_offset = (u32)tp_offset;

And to get the tp, all you need to do is:

tp = (struct tracepoint *)(btp->tp_offset + (unsigned long)_sdata);

I've been thinking of doing this for other parts of the tracepoints and
ftrace code.

BTW, thanks for changing your code. I really appreciate it.

-- Steve

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-28 Thread Steven Rostedt

On Tue, 27 Mar 2018 17:51:55 -0700
Alexei Starovoitov  wrote:

> Turned out it was in init.data section and got poisoned.
> this fixes it:
> @@ -258,6 +258,7 @@
>  LIKELY_PROFILE()\
>  BRANCH_PROFILE()\
>  TRACE_PRINTKS() \
> +   BPF_RAW_TP()\
>  TRACEPOINT_STR()
> 
>   /*
> @@ -585,7 +586,6 @@
>  *(.init.rodata) \
>  FTRACE_EVENTS() \
>  TRACE_SYSCALLS()\
> -   BPF_RAW_TP()\
>  KPROBE_BLACKLIST()  \
>  ERROR_INJECT_WHITELIST()\
>  MEM_DISCARD(init.rodata)\
> 
> and it works :)
> I will clean few other nits I found while debugging and respin.

Getting it properly working was an exercise left to the reader ;-)

Sorry about that, I did a bit of copy and paste to get it working, and
copied from code that did things a bit differently, so I massaged it by
hand, and doing it quickly as I had other things to work on.

-- Steve

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Steven Rostedt

On Tue, 27 Mar 2018 14:58:24 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> +extern struct bpf_raw_event_map *__start__bpf_raw_tp[];
> +extern struct bpf_raw_event_map *__stop__bpf_raw_tp[];
> +
> +struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
> +{
> + struct bpf_raw_event_map* const *btp = __start__bpf_raw_tp;
> +
> + for (; btp < __stop__bpf_raw_tp; btp++)
> + if (!strcmp((*btp)->tp->name, name))
> + return *btp;
> + return NULL;
> +}
> +

OK, this part is broken, and for some reason it didn't include my
changes to bpf_probe.h. I also tested this without setting BPF_EVENTS,
so I wasn't actually testing it.

I added a test in event_trace_init() to make sure that it worked:
(Not included in the patch below)

{
struct bpf_raw_event_map *btp;
btp = bpf_find_raw_tracepoint("sched_switch");
if (btp)
printk("found BPF_RAW_TRACEPOINT: %s %pS\n",
   btp->tp->name, btp->bpf_func);
else
printk("COULD NOT FIND BPF_RAW_TRACEPOINT\n");
}

And it found the tracepoint.

Here's take two

You can add my: Signed-off-by: Steven Rostedt (VMware) <rost...@goodmis.org>

-- Steve

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 1ab0e520d6fc..4fab7392e237 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -178,6 +178,15 @@
 #define TRACE_SYSCALLS()
 #endif
 
+#ifdef CONFIG_BPF_EVENTS
+#define BPF_RAW_TP() . = ALIGN(8); \
+VMLINUX_SYMBOL(__start__bpf_raw_tp) = .;   \
+KEEP(*(__bpf_raw_tp_map))  \
+VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .;
+#else
+#define BPF_RAW_TP()
+#endif
+
 #ifdef CONFIG_SERIAL_EARLYCON
 #define EARLYCON_TABLE() STRUCT_ALIGN();   \
 VMLINUX_SYMBOL(__earlycon_table) = .;  \
@@ -576,6 +585,7 @@
*(.init.rodata) \
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
+   BPF_RAW_TP()\
KPROBE_BLACKLIST()  \
ERROR_INJECT_WHITELIST()\
MEM_DISCARD(init.rodata)\
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 399ebe6f90cf..fb4778c0a248 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -470,8 +470,9 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
void *ctx);
 int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog 
*prog);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
-int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
-int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
+int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
 #else
 static inline unsigned int trace_call_bpf(struct trace_event_call *call, void 
*ctx)
 {
@@ -491,14 +492,18 @@ perf_event_query_prog_array(struct perf_event *event, 
void __user *info)
 {
return -EOPNOTSUPP;
 }
-static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *p)
+static inline int bpf_probe_register(struct bpf_raw_event_map *btp, struct 
bpf_prog *p)
 {
return -EOPNOTSUPP;
 }
-static inline int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog 
*p)
+static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct 
bpf_prog *p)
 {
return -EOPNOTSUPP;
 }
+static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char 
*name)
+{
+   return NULL;
+}
 #endif
 
 enum {
diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 39a283c61c51..35db8dd48c4c 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -36,4 +36,9 @@ struct tracepoint {
u32 num_args;
 };
 
+struct bpf_raw_event_map {
+   struct tracepoint   *tp;
+   void*bpf_func;
+};
+
 #endif
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index d2cc0663e618..bb8ed2f530ad 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -76,7 +76,13 @@ __bpf_trace_##call(void *__

Re: [PATCH v2 bpf-next] bpf, tracing: unbreak lttng

2018-03-27 Thread Steven Rostedt

On Tue, 27 Mar 2018 11:39:19 +0200
Daniel Borkmann <dan...@iogearbox.net> wrote:

> On 03/27/2018 02:07 AM, Steven Rostedt wrote:
> > On Mon, 26 Mar 2018 16:02:20 -0700
> > Alexei Starovoitov <a...@kernel.org> wrote:
> >   
> >> for_each_kernel_tracepoint() is used by out-of-tree lttng module
> >> and therefore cannot be changed.  
> > 
> > This is false and misleading. NACK.  
> 
> Steven, while I really don't care about this particular function, you wrote
> a few emails ago, quote:
> 
>   Look, the tracepoint code was written by Mathieu for LTTng, and perf
>   and ftrace were able to benefit because of it, as well as your bpf
>   code. For this, we agreed to keep this function around for his use,
>   as its the only thing he requires. Everyone has been fine with that.
>   [...] Having that function for LTTng does not hurt us. And I will NACK
>   removing it.
> 
> So later saying that "this is false and misleading" is kind of misleading
> by itself. ;-) Anyway, it would probably make sense to add a comment to
> for_each_kernel_tracepoint() that this is used by LTTng so that people
> don't accidentally remove it due to missing in-tree user. I would think
> some sort of clarification/background in a comment or such would have
> avoided the whole confusion and resulting discussion around this in the
> first place.

I agree, a comment should be added. But the function can still be
modified, which is what I meant here by being misleading.

> 
> Btw, in networking land, as soon as there is no in-tree user for a particular
> kernel function, it will get ripped out, no matter what. Given this is also
> the typical convention in the kernel, it may have caused some confusion with
> above preference.

This is usually the case with me too. This came from Mathieu doing a
lot of work to help perf and ftrace, but keep this for him to maintain
LTTng. This was the solution to a long drawn out flame war. Yes,
there's a lot of history behind that function, and I agree that we
should comment the history behind it.

> 
> Anyway, given v6 is out now, I've tossed the old series from bpf-next tree.
> So I hope we can all move on with some more constructive discussion. :-)

Yes, which we are doing around the kallsyms part. ;-)

-- Steve

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Steven Rostedt


[ Added Andrew Morton too ]

On Tue, 27 Mar 2018 15:00:41 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> On Tue, 27 Mar 2018 11:45:34 -0700
> Alexei Starovoitov <a...@fb.com> wrote:
> 
> > >> +
> > >> +snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
> > >> +addr = kallsyms_lookup_name(buf);
> > >> +if (!addr)
> > >> +return -ENOENT;
> > >> +
> > >> +return tracepoint_probe_register(tp, (void *)addr, prog);
> > >
> > > You are putting in a hell of a lot of trust with kallsyms returning
> > > properly. I can see this being very fragile. This is calling a function
> > > based on the result of kallsyms. I'm sure the security folks would love
> > > this.
> > >
> > > There's a few things to make this a bit more robust. One is to add a
> > > table that points to all __bpf_trace_* functions, and verify that the
> > > result from kallsyms is in that table.
> > >
> > > Honestly, I think this is too much of a short cut and a hack. I know
> > > you want to keep it "simple" and save space, but you really should do
> > > it the same way ftrace and perf do it. That is, create a section and
> > > have all tracepoints create a structure that holds a pointer to the
> > > tracepoint and to the bpf probe function. Then you don't even need the
> > > kernel_tracepoint_find_by_name(), you just iterate over your table and
> > > you get the tracepoint and the bpf function associated to it.
> > >
> > > Relying on kallsyms to return an address to execute is just way too
> > > extreme and fragile for my liking.
> > 
> > Wasting extra 8bytes * number_of_tracepoints just for lack of trust
> > in kallsyms doesn't sound like good trade off to me.
> > If kallsyms are inaccurate all sorts of things will break:
> > kprobes, livepatch, etc.
> > I'd rather suggest for ftrace to use kallsyms approach as well
> > and reduce memory footprint.  
> 
> If Linus, Thomas, Peter, Ingo, and the security folks trust kallsyms to
> return a valid function pointer from a name, then sure, we can try
> going that way.

I would like an ack from Linus and/or Andrew before we go further down
this road.

-- Steve

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Steven Rostedt

On Tue, 27 Mar 2018 15:00:41 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> >  Wasting extra 8bytes * number_of_tracepoints just for lack of trust
> > in kallsyms doesn't sound like good trade off to me.
> > If kallsyms are inaccurate all sorts of things will break:
> > kprobes, livepatch, etc.

And if kallsyms breaks, these will break by failing to attach, or some
other benign error. Ftrace uses kallsyms to find functions too, but it
only enables functions based on the result, it doesn't use the result
for anything but to compare it to what it already knows.

This is a first that I know of to trust that kallsyms returns something
that you expect to execute with no other validation. It may be valid,
but it also makes me very nervous too. If others are fine with such an
approach, then OK, we can enter a new chapter of development where we
can use kallsyms to find the functions we want to call and use it.

-- Steve

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Steven Rostedt

On Tue, 27 Mar 2018 11:45:34 -0700
Alexei Starovoitov  wrote:

> >> +
> >> +  snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
> >> +  addr = kallsyms_lookup_name(buf);
> >> +  if (!addr)
> >> +  return -ENOENT;
> >> +
> >> +  return tracepoint_probe_register(tp, (void *)addr, prog);  
> >
> > You are putting in a hell of a lot of trust with kallsyms returning
> > properly. I can see this being very fragile. This is calling a function
> > based on the result of kallsyms. I'm sure the security folks would love
> > this.
> >
> > There's a few things to make this a bit more robust. One is to add a
> > table that points to all __bpf_trace_* functions, and verify that the
> > result from kallsyms is in that table.
> >
> > Honestly, I think this is too much of a short cut and a hack. I know
> > you want to keep it "simple" and save space, but you really should do
> > it the same way ftrace and perf do it. That is, create a section and
> > have all tracepoints create a structure that holds a pointer to the
> > tracepoint and to the bpf probe function. Then you don't even need the
> > kernel_tracepoint_find_by_name(), you just iterate over your table and
> > you get the tracepoint and the bpf function associated to it.
> >
> > Relying on kallsyms to return an address to execute is just way too
> > extreme and fragile for my liking.  
> 
> Wasting extra 8bytes * number_of_tracepoints just for lack of trust
> in kallsyms doesn't sound like good trade off to me.
> If kallsyms are inaccurate all sorts of things will break:
> kprobes, livepatch, etc.
> I'd rather suggest for ftrace to use kallsyms approach as well
> and reduce memory footprint.

If Linus, Thomas, Peter, Ingo, and the security folks trust kallsyms to
return a valid function pointer from a name, then sure, we can try
going that way.

-- Steve

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Steven Rostedt

On Tue, 27 Mar 2018 13:11:43 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> On Tue, 27 Mar 2018 13:02:11 -0400
> Steven Rostedt <rost...@goodmis.org> wrote:
> 
> > Honestly, I think this is too much of a short cut and a hack. I know
> > you want to keep it "simple" and save space, but you really should do
> > it the same way ftrace and perf do it. That is, create a section and
> > have all tracepoints create a structure that holds a pointer to the
> > tracepoint and to the bpf probe function. Then you don't even need the
> > kernel_tracepoint_find_by_name(), you just iterate over your table and
> > you get the tracepoint and the bpf function associated to it.  
> 
> Also, if you do it the perf/ftrace way, you get support for module
> tracepoints pretty much for free. Which would include tracepoints in
> networking code that is loaded by a module.

This doesn't include module code (but that wouldn't be too hard to set
up), but I compiled and booted this. I didn't test if it works (I
didn't have the way to test bpf here). But this patch applies on top of
this patch (patch 8). You can remove patch 7 and fold this into this
patch. And then you can also make the __bpf_trace_* function static.

This would be much more robust and less error prone.

-- Steve

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 1ab0e520d6fc..4fab7392e237 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -178,6 +178,15 @@
 #define TRACE_SYSCALLS()
 #endif
 
+#ifdef CONFIG_BPF_EVENTS
+#define BPF_RAW_TP() . = ALIGN(8); \
+VMLINUX_SYMBOL(__start__bpf_raw_tp) = .;   \
+KEEP(*(__bpf_raw_tp_map))  \
+VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .;
+#else
+#define BPF_RAW_TP()
+#endif
+
 #ifdef CONFIG_SERIAL_EARLYCON
 #define EARLYCON_TABLE() STRUCT_ALIGN();   \
 VMLINUX_SYMBOL(__earlycon_table) = .;  \
@@ -576,6 +585,7 @@
*(.init.rodata) \
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
+   BPF_RAW_TP()\
KPROBE_BLACKLIST()  \
ERROR_INJECT_WHITELIST()\
MEM_DISCARD(init.rodata)\
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 399ebe6f90cf..fb4778c0a248 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -470,8 +470,9 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
void *ctx);
 int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog 
*prog);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
-int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
-int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
+int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
 #else
 static inline unsigned int trace_call_bpf(struct trace_event_call *call, void 
*ctx)
 {
@@ -491,14 +492,18 @@ perf_event_query_prog_array(struct perf_event *event, 
void __user *info)
 {
return -EOPNOTSUPP;
 }
-static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *p)
+static inline int bpf_probe_register(struct bpf_raw_event_map *btp, struct 
bpf_prog *p)
 {
return -EOPNOTSUPP;
 }
-static inline int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog 
*p)
+static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct 
bpf_prog *p)
 {
return -EOPNOTSUPP;
 }
+static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char 
*name)
+{
+   return NULL;
+}
 #endif
 
 enum {
diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 39a283c61c51..35db8dd48c4c 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -36,4 +36,9 @@ struct tracepoint {
u32 num_args;
 };
 
+struct bpf_raw_event_map {
+   struct tracepoint   *tp;
+   void*bpf_func;
+};
+
 #endif
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f100c63ff19e..6037a2f0108a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1312,7 +1312,7 @@ static int bpf_obj_get(const union bpf_attr *attr)
 }
 
 struct bpf_raw_tracepoint {
-   struct tracepoint *tp;
+   struct bpf_raw_

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Steven Rostedt

On Tue, 27 Mar 2018 13:02:11 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> Honestly, I think this is too much of a short cut and a hack. I know
> you want to keep it "simple" and save space, but you really should do
> it the same way ftrace and perf do it. That is, create a section and
> have all tracepoints create a structure that holds a pointer to the
> tracepoint and to the bpf probe function. Then you don't even need the
> kernel_tracepoint_find_by_name(), you just iterate over your table and
> you get the tracepoint and the bpf function associated to it.

Also, if you do it the perf/ftrace way, you get support for module
tracepoints pretty much for free. Which would include tracepoints in
networking code that is loaded by a module.

-- Steve

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Steven Rostedt

On Mon, 26 Mar 2018 19:47:03 -0700
Alexei Starovoitov  wrote:


> Ctrl-C of tracing daemon or cmdline tool that uses this feature
> will automatically detach bpf program, unload it and
> unregister tracepoint probe.
> 
> On the kernel side for_each_kernel_tracepoint() is used

You need to update the change log to state
kernel_tracepoint_find_by_name().

But looking at the code, I really think you should do it properly and
not rely on a hack that finds your function via kallsyms lookup and
then executing the address it returns.

> to find a tracepoint with "xdp_exception" name
> (that would be __tracepoint_xdp_exception record)
> 
> Then kallsyms_lookup_name() is used to find the addr
> of __bpf_trace_xdp_exception() probe function.
> 
> And finally tracepoint_probe_register() is used to connect probe
> with tracepoint.
> 
> Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
> tracepoint mechanisms. perf_event_open() can be used in parallel
> on the same tracepoint.
> Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
> Each with its own bpf program. The kernel will execute
> all tracepoint probes and all attached bpf programs.
> 
> In the future bpf_raw_tracepoints can be extended with
> query/introspection logic.
> 
> Signed-off-by: Alexei Starovoitov 
> ---
>  include/linux/bpf_types.h|   1 +
>  include/linux/trace_events.h |  37 +
>  include/trace/bpf_probe.h|  87 
>  include/trace/define_trace.h |   1 +
>  include/uapi/linux/bpf.h |  11 +++
>  kernel/bpf/syscall.c |  78 ++
>  kernel/trace/bpf_trace.c | 188 
> +++
>  7 files changed, 403 insertions(+)
>  create mode 100644 include/trace/bpf_probe.h
> 
>


> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -468,6 +468,8 @@ unsigned int trace_call_bpf(struct trace_event_call 
> *call, void *ctx);
>  int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog 
> *prog);
>  void perf_event_detach_bpf_prog(struct perf_event *event);
>  int perf_event_query_prog_array(struct perf_event *event, void __user *info);
> +int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
> +int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
>  #else
>  static inline unsigned int trace_call_bpf(struct trace_event_call *call, 
> void *ctx)
>  {
> @@ -487,6 +489,14 @@ perf_event_query_prog_array(struct perf_event *event, 
> void __user *info)
>  {
>   return -EOPNOTSUPP;
>  }
> +static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog 
> *p)
> +{
> + return -EOPNOTSUPP;
> +}
> +static inline int bpf_probe_unregister(struct tracepoint *tp, struct 
> bpf_prog *p)
> +{
> + return -EOPNOTSUPP;
> +}
>  #endif
>  
>  enum {
> @@ -546,6 +556,33 @@ extern void ftrace_profile_free_filter(struct perf_event 
> *event);
>  void perf_trace_buf_update(void *record, u16 type);
>  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>  
> +void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> +void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> +void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3);
> +void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4);
> +void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5);
> +void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5, u64 arg6);
> +void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7);
> +void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> + u64 arg8);
> +void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
> + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> + u64 arg8, u64 arg9);
> +void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10);
> +void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11);
> +void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +  u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +  u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12);
>  void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
>  struct trace_event_call *call, u64 count,
>  struct pt_regs *regs, struct hlist_head *head,
> diff --git

Re: [PATCH v6 bpf-next 06/11] tracepoint: compute num_args at build time

2018-03-27 Thread Steven Rostedt

On Mon, 26 Mar 2018 19:47:01 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> From: Alexei Starovoitov <a...@kernel.org>
> 
> compute number of arguments passed into tracepoint
> at compile time and store it as part of 'struct tracepoint'.
> The number is necessary to check safety of bpf program access that
> is coming in subsequent patch.
> 
> Signed-off-by: Alexei Starovoitov <a...@kernel.org>

Reviewed-by: Steven Rostedt (VMware) <rost...@goodmis.org>

-- Steve

> ---
>  include/linux/tracepoint-defs.h |  1 +
>  include/linux/tracepoint.h  | 12 ++--
>  include/trace/define_trace.h| 14 +++---
>  3 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> index 64ed7064f1fa..39a283c61c51 100644
> --- a/include/linux/tracepoint-defs.h
> +++ b/include/linux/tracepoint-defs.h
> @@ -33,6 +33,7 @@ struct tracepoint {
>   int (*regfunc)(void);
>   void (*unregfunc)(void);
>   struct tracepoint_func __rcu *funcs;
> + u32 num_args;
>  };
>  
>  #endif
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..c92f4adbc0d7 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -230,18 +230,18 @@ extern void syscall_unregfunc(void);
>   * structures, so we create an array of pointers that will be used for 
> iteration
>   * on the tracepoints.
>   */
> -#define DEFINE_TRACE_FN(name, reg, unreg) \
> +#define DEFINE_TRACE_FN(name, reg, unreg, num_args)   \
>   static const char __tpstrtab_##name[]\
>   __attribute__((section("__tracepoints_strings"))) = #name;   \
>   struct tracepoint __tracepoint_##name\
>   __attribute__((section("__tracepoints"))) =  \
> - { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
> + { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL, 
> num_args };\
>   static struct tracepoint * const __tracepoint_ptr_##name __used  \
>   __attribute__((section("__tracepoints_ptrs"))) = \
>   &__tracepoint_##name;
>  
> -#define DEFINE_TRACE(name)   \
> - DEFINE_TRACE_FN(name, NULL, NULL);
> +#define DEFINE_TRACE(name, num_args) \
> + DEFINE_TRACE_FN(name, NULL, NULL, num_args);
>  
>  #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)   \
>   EXPORT_SYMBOL_GPL(__tracepoint_##name)
> @@ -275,8 +275,8 @@ extern void syscall_unregfunc(void);
>   return false;   \
>   }
>  
> -#define DEFINE_TRACE_FN(name, reg, unreg)
> -#define DEFINE_TRACE(name)
> +#define DEFINE_TRACE_FN(name, reg, unreg, num_args)
> +#define DEFINE_TRACE(name, num_args)
>  #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
>  #define EXPORT_TRACEPOINT_SYMBOL(name)
>  
> diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
> index d9e3d4aa3f6e..96b22ace9ae7 100644
> --- a/include/trace/define_trace.h
> +++ b/include/trace/define_trace.h
> @@ -25,7 +25,7 @@
>  
>  #undef TRACE_EVENT
>  #define TRACE_EVENT(name, proto, args, tstruct, assign, print)   \
> - DEFINE_TRACE(name)
> + DEFINE_TRACE(name, COUNT_ARGS(args))
>  
>  #undef TRACE_EVENT_CONDITION
>  #define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, 
> print) \
> @@ -39,24 +39,24 @@
>  #undef TRACE_EVENT_FN
>  #define TRACE_EVENT_FN(name, proto, args, tstruct,   \
>   assign, print, reg, unreg)  \
> - DEFINE_TRACE_FN(name, reg, unreg)
> + DEFINE_TRACE_FN(name, reg, unreg, COUNT_ARGS(args))
>  
>  #undef TRACE_EVENT_FN_COND
>  #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,
> \
>   assign, print, reg, unreg)  \
> - DEFINE_TRACE_FN(name, reg, unreg)
> + DEFINE_TRACE_FN(name, reg, unreg, COUNT_ARGS(args))
>  
>  #undef DEFINE_EVENT
>  #define DEFINE_EVENT(template, name, proto, args) \
> - DEFINE_TRACE(name)
> + DEFINE_TRACE(name, COUNT_ARGS(args))
>  
>  #undef DEFINE_EVENT_FN
>  #define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
> - DEFINE_TRACE_FN(name, reg, unreg)
> + DEFINE_TRACE_FN(name, reg, unreg, COUNT_ARGS(args))
>  
>  #undef DEFINE_EVENT_PRINT
>  #define DEFINE_EVENT_PRINT(template, name, proto, args, print)   \
> - DEFINE_TRACE(name)
> + DEFINE_TRACE(name, COUNT_ARGS(args))
>  
>  #undef DEFINE_EVENT_CONDITION
>  #define DEFINE_EVENT_CONDITION(template, name, proto, args, cond) \
> @@ -64,7 +64,7 @@
>  
>  #undef DECLARE_TRACE
>  #define DECLARE_TRACE(name, proto, args) \
> - DEFINE_TRACE(name)
> + DEFINE_TRACE(name, COUNT_ARGS(args))
>  
>  #undef TRACE_INCLUDE
>  #undef __TRACE_INCLUDE

Re: [PATCH v6 bpf-next 07/11] tracepoint: introduce kernel_tracepoint_find_by_name

2018-03-27 Thread Steven Rostedt

On Tue, 27 Mar 2018 10:18:24 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:

> - On Mar 27, 2018, at 10:07 AM, rostedt rost...@goodmis.org wrote:
> 
> > On Mon, 26 Mar 2018 19:47:02 -0700
> > Alexei Starovoitov <a...@fb.com> wrote:
> >   
> >> From: Alexei Starovoitov <a...@kernel.org>
> >> 
> >> introduce kernel_tracepoint_find_by_name() helper to let bpf core
> >> find tracepoint by name and later attach bpf probe to a tracepoint
> >> 
> >> Signed-off-by: Alexei Starovoitov <a...@kernel.org>  
> > 
> > Reviewed-by: Steven Rostedt (VMware) <rost...@goodmis.org>  
> 
> Steven showed preference for tracepoint_kernel_find_by_name() at some
> point (starting with a tracepoint_ prefix). I'm find with either of
> the names.

Yeah, I do prefer tracepoint_kernel_find_by_name() to stay consistent
with the other tracepoint functions. But we have
"for_each_kernel_tracepoint()" and not "for_each_tracepoint_kernel()",
thus we need to pick being consistent with one or the other. One answer
is to use tracpoint_kernel_find_by_name() and rename the for_each to
for_each_tracpoint_kernel().

-- Steve


> 
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
> 
> Thanks,
> 
> Mathieu
> 
> > 
> > Thanks for doing this Alexei!
> > 
> > One nit below.

Re: [PATCH v6 bpf-next 07/11] tracepoint: introduce kernel_tracepoint_find_by_name

2018-03-27 Thread Steven Rostedt

On Mon, 26 Mar 2018 19:47:02 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> From: Alexei Starovoitov <a...@kernel.org>
> 
> introduce kernel_tracepoint_find_by_name() helper to let bpf core
> find tracepoint by name and later attach bpf probe to a tracepoint
> 
> Signed-off-by: Alexei Starovoitov <a...@kernel.org>

Reviewed-by: Steven Rostedt (VMware) <rost...@goodmis.org>

Thanks for doing this Alexei!

One nit below.


> ---
>  include/linux/tracepoint.h | 6 ++
>  kernel/tracepoint.c| 9 +
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c92f4adbc0d7..a00b84473211 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -43,6 +43,12 @@ tracepoint_probe_unregister(struct tracepoint *tp, void 
> *probe, void *data);
>  extern void
>  for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
>   void *priv);
> +#ifdef CONFIG_TRACEPOINTS
> +struct tracepoint *kernel_tracepoint_find_by_name(const char *name);
> +#else
> +static inline struct tracepoint *
> +kernel_tracepoint_find_by_name(const char *name) { return NULL; }
> +#endif
>  
>  #ifdef CONFIG_MODULES
>  struct tp_module {
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 671b13457387..e2a9a0391ae2 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -528,6 +528,15 @@ void for_each_kernel_tracepoint(void (*fct)(struct 
> tracepoint *tp, void *priv),
>  }
>  EXPORT_SYMBOL_GPL(for_each_kernel_tracepoint);
>  
> +struct tracepoint *kernel_tracepoint_find_by_name(const char *name)
> +{
> + struct tracepoint * const *tp = __start___tracepoints_ptrs;
> +
> + for (; tp < __stop___tracepoints_ptrs; tp++)
> + if (!strcmp((*tp)->name, name))
> + return *tp;


Usually for cases like this, we prefer to add brackets for the for
block, as it's not a single line below it.

for (; tp < __stop__tracepoints_ptrs; tp++) {
if (!strcmp((*tp)->name, name))
return *tp;
}

-- Steve




> + return NULL;
> +}
>  #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
>  
>  /* NB: reg/unreg are called while guarded with the tracepoints_mutex */

Re: [PATCH bpf-next] bpf, tracing: unbreak lttng

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 15:35:56 -0700
Alexei Starovoitov  wrote:


> > This patch is not reverting to the old code properly. It introduces a
> > static inline void function that returns NULL. Please compile-test
> > with CONFIG_TRACEPOINTS=n before submitting a patch involving tracepoints.  
> 
> right. good catch. v2 is coming.

Either fold the patch into the original patch, or I'm pulling in
Mathieu's patch and pushing it to Linus come the merge window.

-- Steve

Re: [PATCH v2 bpf-next] bpf, tracing: unbreak lttng

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 16:02:20 -0700
Alexei Starovoitov  wrote:

> for_each_kernel_tracepoint() is used by out-of-tree lttng module
> and therefore cannot be changed.

This is false and misleading. NACK.

-- Steve

> Instead introduce kernel_tracepoint_find_by_name() to find
> tracepoint by name.
> 
> Fixes: 9e9afbae6514 ("tracepoint: compute num_args at build time")
> Signed-off-by: Alexei Starovoitov 
> ---
> v1->v2: fix 'undef CONFIG_TRACEPOINTS' build as spotted by Mathieu
>

Re: [PATCH bpf-next] bpf, tracing: unbreak lttng

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 15:25:32 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> On 3/26/18 3:15 PM, Steven Rostedt wrote:
> > On Mon, 26 Mar 2018 15:08:45 -0700
> > Alexei Starovoitov <a...@kernel.org> wrote:
> >  
> >> for_each_kernel_tracepoint() is used by out-of-tree lttng module
> >> and therefore cannot be changed.
> >> Instead introduce kernel_tracepoint_find_by_name() to find
> >> tracepoint by name.
> >>
> >> Fixes: 9e9afbae6514 ("tracepoint: compute num_args at build time")
> >> Signed-off-by: Alexei Starovoitov <a...@kernel.org>  
> >
> > I'm curious, why can't you rebase? The first patch was never acked.  
> 
> because I think it makes sense to keep such things in the commit log
> and in the separate diff, so next developer is aware of what kind of
> minefield the tracpoints are.

This is a bunch of BS. It's not a minefield, and you can change that
function. Mathieu is perfectly fine in modifying his code to deal with
it. He has several times in the past. But I did not agree with the
approach you were taking, that is why I'm against it. You are playing
the straw man with this.

> No wonder some maintainers refuse to add them.

Good grief. No! The reason maintainers refuse to add them is that
userspace can depend on them, and if that happens, it becomes an ABI.
Stop with this nonsense.

-- Steve

Re: [PATCH bpf-next] bpf, tracing: unbreak lttng

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 15:08:45 -0700
Alexei Starovoitov  wrote:

> for_each_kernel_tracepoint() is used by out-of-tree lttng module
> and therefore cannot be changed.
> Instead introduce kernel_tracepoint_find_by_name() to find
> tracepoint by name.
> 
> Fixes: 9e9afbae6514 ("tracepoint: compute num_args at build time")
> Signed-off-by: Alexei Starovoitov 

I'm curious, why can't you rebase? The first patch was never acked.

-- Steve

Re: [PATCH v5 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 11:39:05 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> On 3/26/18 11:11 AM, Steven Rostedt wrote:
> > On Mon, 26 Mar 2018 10:55:51 -0700
> > Alexei Starovoitov <a...@fb.com> wrote:
> >  
> >> An email ago you were ok to s/return/return NULL/ in your out-of-tree
> >> module, but now flip flop to add new function approach just to
> >> reduce the work you need to do in lttng?
> >> We're not talking about changing __kmalloc signature here.
> >> My patch extends for_each_kernel_tracepoint() api similar to other
> >> for_each_*() iterators and improves possible uses of it.  
> >
> > Alexei, do you have another use case for using
> > for_each_kernel_tracepoint() other than the find_tp? If so, then I'm
> > sure Mathieu can handle the change.
> >
> > But I think it's cleaner to add a tracepoint_find_by_name() function.
> > If you come up with another use case for using the for_each* function
> > then we'll consider changing it then.  
> 
> another use case ?! Frankly such reasoning smells.

WTF is the big deal here?

> 
> I'm fine doing quick followup patch to add tracepoint_find_by_name()

And BTW, it would actually have to be called
tracepoint_core_find_by_name() as it will not deal with modules.
Modules would have to have much more work to deal with.

> and restore 'return void' behavior of for_each_kernel_tracepoint's

What? you can't rebase now? Just don't touch that function.

> callback, but I'm struggling to accept the precedent it will create
> that all exported functions of kernel/tracepoint.c are really
> lttng extensions and we cannot easily change them.

First, my argument about your use case has little to do with LTTng.
I have to maintain this code, and this is my preference. Just like I do
the silly

 /* Comment like this,
  * for multiple lines
  */

When I deal with the networking code. Because that's the preference for
the networking folks.

> I'd like to hear Linus take on this.

I doubt he cares about something this petty.

-- Steve

Re: [PATCH v5 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 10:55:51 -0700
Alexei Starovoitov  wrote:

> An email ago you were ok to s/return/return NULL/ in your out-of-tree
> module, but now flip flop to add new function approach just to
> reduce the work you need to do in lttng?
> We're not talking about changing __kmalloc signature here.
> My patch extends for_each_kernel_tracepoint() api similar to other
> for_each_*() iterators and improves possible uses of it.

Alexei, do you have another use case for using
for_each_kernel_tracepoint() other than the find_tp? If so, then I'm
sure Mathieu can handle the change.

But I think it's cleaner to add a tracepoint_find_by_name() function.
If you come up with another use case for using the for_each* function
then we'll consider changing it then.

> One thing is to be nice to out-of-tree and do not break them
> for no reason, but arguing that kernel shouldn't add a minor extension
> to for_each_kernel_tracepoint() api is really taking the whole thing
> to next level.

That's not the point. I disagree with the reason for the change, and
believe that it would be cleaner to add a find_by_name() function.
Which would make your patch set even cleaner. 

Instead of having in the bpf code:

static void *__find_tp(struct tracepoint *tp, void *priv)
{
char *name = priv;

if (!strcmp(tp->name, name))
return tp;
return NULL;
}

[..]

tp = for_each_kernel_tracepoint(__find_tp, tp_name);
if (!tp)
return -ENOENT;

You would simply have:

tp = tracepoint_find_by_name(tp_name);
if (!tp)
return -ENOENT;

That would make the code more obvious to what it is doing. And this
does not impede your patch set at all.

-- Steve

Re: [PATCH v5 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 09:47:21 -0700
Alexei Starovoitov  wrote:

> I don't mind to _rename_ for_each_kernel_tracepoint() into
> tracepoint_find_by_name(), but keeping exported function
> just to be used by out-of-tree modules would be wrong message for
> the kernel community in general.
> With my patch the for_each_kernel_tracepoint() will be used by bpf side
> and out-of-tree can trivially hack their callbacks to keep working.
> imo that's a better approach then renaming it.

Look, the tracepoint code was written by Mathieu for LTTng, and perf
and ftrace were able to benefit because of it, as well as your bpf code.
For this, we agreed to keep this function around for his use, as its the
only thing he requires. Everyone has been fine with that. Not all out
of tree code is evil. In fact, some out of tree modules help the kernel
community. You ask why I care. Because PREEMPT_RT has been one of those
out of tree modules that has helped the kernel community a lot. Have
you noticed that there are "raw_spin_lock()" and "spin_lock()"? There's
no difference between the two in the kernel. Why have them? Because
they are used by PREEMPT_RT.

Having that function for LTTng does not hurt us. And I will NACK
removing it.

-- Stevwe

Re: [PATCH v5 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 09:25:07 -0700
Alexei Starovoitov  wrote:

> commit log of patch 6 states:
> 
> "for_each_tracepoint_range() api has no users inside the kernel.
> Make it more useful with ability to stop for_each() loop depending
> via callback return value.
> In such form it's used in subsequent patch."
> 
> and in patch 7:
> 
> +static void *__find_tp(struct tracepoint *tp, void *priv)
> +{
> +   char *name = priv;
> +
> +   if (!strcmp(tp->name, name))
> +   return tp;
> +   return NULL;
> +}
> ...
> +   struct tracepoint *tp;
> ...
> +   tp = for_each_kernel_tracepoint(__find_tp, tp_name);
> +   if (!tp)
> +   return -ENOENT;
> 
> still not obvious?

Please just create a new function called tracepoint_find_by_name(), and
use that. I don't see any benefit in using a for_each* function for
such a simple routine. Not to mention, you then don't need to know the
internals of a tracepoint in kernel/bpf/syscall.c.

-- Steve

Re: [PATCH v5 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 11:56:15 -0400
Steven Rostedt <rost...@goodmis.org> wrote:

> On Fri, 23 Mar 2018 19:30:34 -0700
> Alexei Starovoitov <a...@fb.com> wrote:
> 
> > +static void *for_each_tracepoint_range(struct tracepoint * const *begin,
> > +  struct tracepoint * const *end,
> > +  void *(*fct)(struct tracepoint *tp, void 
> > *priv),
> > +  void *priv)
> >  {
> > struct tracepoint * const *iter;
> > +   void *ret;
> >  
> > if (!begin)
> > -   return;
> > -   for (iter = begin; iter < end; iter++)
> > -   fct(*iter, priv);
> > +   return NULL;
> > +   for (iter = begin; iter < end; iter++) {
> > +   ret = fct(*iter, priv);
> > +   if (ret)
> > +   return ret;  
> 
> So you just stopped the loop here. You have an inconsistent state. What
> about the functions that were called before. How do you undo them? Or
> what about the rest that haven't been touched. This function gives no
> feedback to the caller.
> 

OK, I see my confusion with this patch. I much rather have a new
function, and this isn't about being nice to out of tree modules. We
can keep this function as is (to be nice), but my biggest squabble
about this patch is that the function name is inconsistent to what its
doing. When I have a function that says "for_each" I expect it to go
through each function and not exit out. This is because of what I said
above. When you have a "for_each" function that stops in the middle, you
have a state that you may need to deal with. How would another use of
that function clean up the mess if it expected to fail at some random
location.

I see in the next patch that you are using it simply to find a
tracepoint with the given name. I'd be much happier to add a new
function called:

   tracepoint_find_by_name(const char *name)

Because using a "for_each" to implement such a simple function seems
more of a hack.

-- Steve

Re: [PATCH v5 bpf-next 00/10] bpf, tracing: introduce bpf raw tracepoints

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 09:00:33 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> On 3/26/18 8:47 AM, Steven Rostedt wrote:
> > On Mon, 26 Mar 2018 17:32:02 +0200
> > Daniel Borkmann <dan...@iogearbox.net>


> >> On 03/26/2018 05:04 PM, Steven Rostedt wrote:  
> >>> On Mon, 26 Mar 2018 10:28:03 +0200
> >>> Daniel Borkmann <dan...@iogearbox.net> wrote:
> >>>  
> >>>>> tracepointbase  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
> >>>>> task_rename   1.1M   769K947K1.0M
> >>>>> urandom_read  789K   697K750K755K  
> >>>>
> >>>> Applied to bpf-next, thanks Alexei!  
> >>>
> >>> Please wait till you have the proper acks. Some of this affects
> >>> tracing.  
> >>
> >> Ok, I thought time up to v5 was long enough. Anyway, in case there are
> >> objections I can still toss out the series from bpf-next tree worst case
> >> should e.g. follow-up fixups not be appropriate.  
> >
> > Yeah, I've been traveling a bit which slowed down my review process
> > (trying to catch up).  
> 
> v1 of this set was posted Feb 28.

Yep, Where I traveled to the West coast 2/26 - 3/1 (but due to snow
storms, I didn't get home till late 3/2). Then I went back 3/6 and came
home 3/8 (again due to another snow storm, it was 3/9). Then I went to
ELC from 3/11 to 3/15 (Luckily, the third snow storm hit 3/14, and
didn't affect my return trip).

> imo one month is not an acceptable delay for maintainer to review
> the patches. You really need to consider group maintainership as
> we do with Daniel for bpf tree.

Perhaps, (which I talked to Masami about, just need to go through
logistics). But the tracing code isn't high volume, and the three weeks
of traveling for me was a fluke (didn't look at my schedule when I
agreed to make that second one).

> 
> > My main concern is with patch 6, as there are
> > external users of those functions. Although, we generally don't cater
> > to out of tree code, we play nice with LTTng, and I don't want to break
> > it.  
> 
> out-of-tree module is out of tree. I'm beyond surprised that you
> propose to keep for_each_kernel_tracepoint() as-is with zero in-tree
> users in order to keep lttng working.

I'm nice.

> 
> > I also should probably pull in the patches and run them through my
> > tests to make sure they don't have any other side effects.  
> 
> so let me rephrase.
> You're saying that a change to a function with zero in-tree users
> can somehow break your tests?
> How is that possible?
> Does it mean you also have some out-of-tree modules that will break?
> and that _is_ the real reason for objection?

That function isn't what I'm worried about. You changed much more than
that.

-- Steve

Re: [PATCH v5 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-26 Thread Steven Rostedt

On Fri, 23 Mar 2018 19:30:34 -0700
Alexei Starovoitov  wrote:

> +static void *for_each_tracepoint_range(struct tracepoint * const *begin,
> +struct tracepoint * const *end,
> +void *(*fct)(struct tracepoint *tp, void 
> *priv),
> +void *priv)
>  {
>   struct tracepoint * const *iter;
> + void *ret;
>  
>   if (!begin)
> - return;
> - for (iter = begin; iter < end; iter++)
> - fct(*iter, priv);
> + return NULL;
> + for (iter = begin; iter < end; iter++) {
> + ret = fct(*iter, priv);
> + if (ret)
> + return ret;

So you just stopped the loop here. You have an inconsistent state. What
about the functions that were called before. How do you undo them? Or
what about the rest that haven't been touched. This function gives no
feedback to the caller.

-- Steve


> + }
> + return NULL;
>  }

Re: [PATCH v5 bpf-next 00/10] bpf, tracing: introduce bpf raw tracepoints

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 17:32:02 +0200
Daniel Borkmann <dan...@iogearbox.net> wrote:

> On 03/26/2018 05:04 PM, Steven Rostedt wrote:
> > On Mon, 26 Mar 2018 10:28:03 +0200
> > Daniel Borkmann <dan...@iogearbox.net> wrote:
> >   
> >>> tracepointbase  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
> >>> task_rename   1.1M   769K947K1.0M
> >>> urandom_read  789K   697K750K755K
> >>
> >> Applied to bpf-next, thanks Alexei!  
> > 
> > Please wait till you have the proper acks. Some of this affects
> > tracing.  
> 
> Ok, I thought time up to v5 was long enough. Anyway, in case there are
> objections I can still toss out the series from bpf-next tree worst case
> should e.g. follow-up fixups not be appropriate.

Yeah, I've been traveling a bit which slowed down my review process
(trying to catch up). My main concern is with patch 6, as there are
external users of those functions. Although, we generally don't cater
to out of tree code, we play nice with LTTng, and I don't want to break
it.

I also should probably pull in the patches and run them through my
tests to make sure they don't have any other side effects.

-- Steve

Re: [PATCH v5 bpf-next 00/10] bpf, tracing: introduce bpf raw tracepoints

2018-03-26 Thread Steven Rostedt

On Mon, 26 Mar 2018 10:28:03 +0200
Daniel Borkmann  wrote:

> > tracepointbase  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
> > task_rename   1.1M   769K947K1.0M
> > urandom_read  789K   697K750K755K  
> 
> Applied to bpf-next, thanks Alexei!

Please wait till you have the proper acks. Some of this affects
tracing.

-- Steve

Re: [PATCH v5 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-26 Thread Steven Rostedt

On Fri, 23 Mar 2018 19:30:34 -0700
Alexei Starovoitov  wrote:

> From: Alexei Starovoitov 
> 
> add fancy macro to compute number of arguments passed into tracepoint
> at compile time and store it as part of 'struct tracepoint'.
> The number is necessary to check safety of bpf program access that
> is coming in subsequent patch.
> 
> for_each_tracepoint_range() api has no users inside the kernel.
> Make it more useful with ability to stop for_each() loop depending
> via callback return value.
> In such form it's used in subsequent patch.

I believe this is used by LTTng.

-- Steve

> 
> Signed-off-by: Alexei Starovoitov 
> ---
>  include/linux/tracepoint-defs.h |  1 +
>  include/linux/tracepoint.h  | 28 +++-
>  include/trace/define_trace.h| 14 +++---
>  kernel/tracepoint.c | 27 ---
>  4 files changed, 43 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> index 64ed7064f1fa..39a283c61c51 100644
> --- a/include/linux/tracepoint-defs.h
> +++ b/include/linux/tracepoint-defs.h
> @@ -33,6 +33,7 @@ struct tracepoint {
>   int (*regfunc)(void);
>   void (*unregfunc)(void);
>   struct tracepoint_func __rcu *funcs;
> + u32 num_args;
>  };
>  
>  #endif
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..2194e7c31484 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -40,9 +40,19 @@ tracepoint_probe_register_prio(struct tracepoint *tp, void 
> *probe, void *data,
>  int prio);
>  extern int
>  tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
> -extern void
> -for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
> - void *priv);
> +
> +#ifdef CONFIG_TRACEPOINTS
> +void *
> +for_each_kernel_tracepoint(void *(*fct)(struct tracepoint *tp, void *priv),
> +void *priv);
> +#else
> +static inline void *
> +for_each_kernel_tracepoint(void *(*fct)(struct tracepoint *tp, void *priv),
> +void *priv)
> +{
> + return NULL;
> +}
> +#endif
>  
>  #ifdef CONFIG_MODULES
>  struct tp_module {
> @@ -230,18 +240,18 @@ extern void syscall_unregfunc(void);
>   * structures, so we create an array of pointers that will be used for 
> iteration
>   * on the tracepoints.
>   */
> -#define DEFINE_TRACE_FN(name, reg, unreg) \
> +#define DEFINE_TRACE_FN(name, reg, unreg, num_args)   \
>   static const char __tpstrtab_##name[]\
>   __attribute__((section("__tracepoints_strings"))) = #name;   \
>   struct tracepoint __tracepoint_##name\
>   __attribute__((section("__tracepoints"))) =  \
> - { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
> + { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL, 
> num_args };\
>   static struct tracepoint * const __tracepoint_ptr_##name __used  \
>   __attribute__((section("__tracepoints_ptrs"))) = \
>   &__tracepoint_##name;
>  
> -#define DEFINE_TRACE(name)   \
> - DEFINE_TRACE_FN(name, NULL, NULL);
> +#define DEFINE_TRACE(name, num_args) \
> + DEFINE_TRACE_FN(name, NULL, NULL, num_args);
>  
>  #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)   \
>   EXPORT_SYMBOL_GPL(__tracepoint_##name)
> @@ -275,8 +285,8 @@ extern void syscall_unregfunc(void);
>   return false;   \
>   }
>  
> -#define DEFINE_TRACE_FN(name, reg, unreg)
> -#define DEFINE_TRACE(name)
> +#define DEFINE_TRACE_FN(name, reg, unreg, num_args)
> +#define DEFINE_TRACE(name, num_args)
>  #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
>  #define EXPORT_TRACEPOINT_SYMBOL(name)
>  
> diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
> index d9e3d4aa3f6e..96b22ace9ae7 100644
> --- a/include/trace/define_trace.h
> +++ b/include/trace/define_trace.h
> @@ -25,7 +25,7 @@
>  
>  #undef TRACE_EVENT
>  #define TRACE_EVENT(name, proto, args, tstruct, assign, print)   \
> - DEFINE_TRACE(name)
> + DEFINE_TRACE(name, COUNT_ARGS(args))
>  
>  #undef TRACE_EVENT_CONDITION
>  #define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, 
> print) \
> @@ -39,24 +39,24 @@
>  #undef TRACE_EVENT_FN
>  #define TRACE_EVENT_FN(name, proto, args, tstruct,   \
>   assign, print, reg, unreg)  \
> - DEFINE_TRACE_FN(name, reg, unreg)
> + DEFINE_TRACE_FN(name, reg, unreg, COUNT_ARGS(args))
>  
>  #undef TRACE_EVENT_FN_COND
>  #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,
> \
>

Re: [PATCH v2 bpf-next 5/8] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-23 Thread Steven Rostedt

On Sat, 24 Mar 2018 00:13:28 +0100
Daniel Borkmann  wrote:

> #define UNPACK(...)   __VA_ARGS__
> #define REPEAT_1(FN, DL, X, ...)  FN(X)
> #define REPEAT_2(FN, DL, X, ...)  FN(X) UNPACK DL REPEAT_1(FN, DL, 
> __VA_ARGS__)
> #define REPEAT_3(FN, DL, X, ...)  FN(X) UNPACK DL REPEAT_2(FN, DL, 
> __VA_ARGS__)
> #define REPEAT_4(FN, DL, X, ...)  FN(X) UNPACK DL REPEAT_3(FN, DL, 
> __VA_ARGS__)
> #define REPEAT_5(FN, DL, X, ...)  FN(X) UNPACK DL REPEAT_4(FN, DL, 
> __VA_ARGS__)
> #define REPEAT_6(FN, DL, X, ...)  FN(X) UNPACK DL REPEAT_5(FN, DL, 
> __VA_ARGS__)
> #define REPEAT_7(FN, DL, X, ...)  FN(X) UNPACK DL REPEAT_6(FN, DL, 
> __VA_ARGS__)
> #define REPEAT_8(FN, DL, X, ...)  FN(X) UNPACK DL REPEAT_7(FN, DL, 
> __VA_ARGS__)
> #define REPEAT_9(FN, DL, X, ...)  FN(X) UNPACK DL REPEAT_8(FN, DL, 
> __VA_ARGS__)
> #define REPEAT(X, FN, DL, ...)REPEAT_##X(FN, DL, __VA_ARGS__)
> 
> #define SARG(X)   u64 arg##X
> #define COPY(X)   args[X] = arg##X
> 
> #define __DL_COM  (,)
> #define __DL_SEM  (;)
> 
> #define __SEQ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
> 
> #define BPF_TRACE_DECL_x(x)   \
>   void bpf_trace_run##x(struct bpf_prog *prog,\
> REPEAT(x, SARG, __DL_COM, __SEQ))
> #define BPF_TRACE_DEFN_x(x)   \
>   void bpf_trace_run##x(struct bpf_prog *prog,\
> REPEAT(x, SARG, __DL_COM, __SEQ)) \
>   {   \
>   u64 args[x];\
>   REPEAT(x, COPY, __DL_SEM, __SEQ);   \
>   __bpf_trace_run(prog, args);\
>   }   \
>   EXPORT_SYMBOL_GPL(bpf_trace_run##x)
> 
> So doing a ...
> 
> BPF_TRACE_DECL_x(5);
> BPF_TRACE_DEFN_x(5);
> 
> ... will generate in kernel/trace/bpf_trace.i:
> 
> void bpf_foo_trace_run5(struct bpf_prog *prog, u64 arg0 , u64 arg1 , u64 arg2 
> , u64 arg3 , u64 arg4);
> void bpf_foo_trace_run5(struct bpf_prog *prog, u64 arg0 , u64 arg1 , u64 arg2 
> , u64 arg3 , u64 arg4)
> {
>   u64 args[5];
>   args[0] = arg0 ;
>   args[1] = arg1 ;
>   args[2] = arg2 ;
>   args[3] = arg3 ;
>   args[4] = arg4;
>   __bpf_trace_run(prog, args);
> } [...]
> 
> Meaning, the EVALx() macros could be removed from there, too. Potentially, the
> REPEAT() macro could sit in its own include/linux/ header for others to reuse
> or such.

And people think my macro magic in include/trace/ftrace_event.h is
funky. Now I know who stole my MACRO MAGIC HAT.

-- Steve

Re: [PATCH v3 bpf-next 01/10] treewide: remove struct-pass-by-value from tracepoints arguments

2018-03-22 Thread Steven Rostedt

On Thu, 22 Mar 2018 12:31:12 -0700
Alexei Starovoitov <a...@fb.com> wrote:

> On 3/22/18 11:11 AM, Steven Rostedt wrote:
> > On Thu, 22 Mar 2018 11:01:48 -0700
> > Alexei Starovoitov <a...@fb.com> wrote:
> >  
> >> From: Alexei Starovoitov <a...@kernel.org>
> >>
> >> Fix all tracepoint arguments to pass structures (large and small) by 
> >> reference
> >> instead of by value.
> >> Avoiding passing large structs by value is a good coding style.
> >> Passing small structs sometimes is beneficial, but in all cases
> >> it makes no difference vs readability of the code.
> >> The subsequent patch enforces that all tracepoints args are either integers
> >> or pointers and fit into 64-bit.  
> >
> > But some of these structures are used to force type checking, and are
> > just the same size as a number. That's why they don't have "struct" in
> > front of them. Like pmd_t. Will the subsequent patches really break if
> > the structure itself has one element that is of size long? Just seems
> > to add extra code to pass in an address to something that fits into a
> > single register.  
> 
> yeah. C doesn't allow casting of 'struct s { u64 var };' into u64
> without massive hacks and aliasing warnings by compiler.
> CAST_TO_U64 macro in patch 7 will prevent tracepoint arguments to be
> things like pmd_t. It's not perfect, but doing & of pmd_t variable
> is imo clean enough as you can see in this patch.
> The macro can be tweaked to do the cast like
> *(sizeof(typeof(arg))*),
> but there is no way to get rid of compiler warning.

OK, but instead of changing it to pass by reference, why not just pass
the value. That is:

 static void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
-   trace_xen_mmu_set_pte_atomic(ptep, pte);
+   trace_xen_mmu_set_pte_atomic(ptep, native_pte_val(pte));
set_64bit((u64 *)ptep, native_pte_val(pte));
 }

It shouldn't add any extra code, as those helper functions are
basically just special casts.

-- Steve

Re: [PATCH v3 bpf-next 01/10] treewide: remove struct-pass-by-value from tracepoints arguments

2018-03-22 Thread Steven Rostedt

On Thu, 22 Mar 2018 11:01:48 -0700
Alexei Starovoitov  wrote:

> From: Alexei Starovoitov 
> 
> Fix all tracepoint arguments to pass structures (large and small) by reference
> instead of by value.
> Avoiding passing large structs by value is a good coding style.
> Passing small structs sometimes is beneficial, but in all cases
> it makes no difference vs readability of the code.
> The subsequent patch enforces that all tracepoints args are either integers
> or pointers and fit into 64-bit.

But some of these structures are used to force type checking, and are
just the same size as a number. That's why they don't have "struct" in
front of them. Like pmd_t. Will the subsequent patches really break if
the structure itself has one element that is of size long? Just seems
to add extra code to pass in an address to something that fits into a
single register.

-- Steve

Re: [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time

2018-03-22 Thread Steven Rostedt

On Thu, 22 Mar 2018 08:51:16 -0700
Alexei Starovoitov  wrote:

> So I definitely support the idea of build time warn for large
> number of args.

I'm more for a build time error for large number of args.

-- Steve

Re: [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time

2018-03-22 Thread Steven Rostedt

On Wed, 21 Mar 2018 15:05:46 -0700
Alexei Starovoitov  wrote:

> Like the only reason my patch is counting till 17 is because of
> trace_iwlwifi_dev_ucode_error().
> The next offenders are using 12 arguments:
> trace_mc_event()
> trace_mm_vmscan_lru_shrink_inactive()
> 
> Clearly not every efficient usage of it:
>  trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>  nr_scanned, nr_reclaimed,
>  stat.nr_dirty,  stat.nr_writeback,
>  stat.nr_congested, stat.nr_immediate,
>  stat.nr_activate, stat.nr_ref_keep,
>  stat.nr_unmap_fail,
>  sc->priority, file);
> could have passed  instead.

Yes they should have, and if I was on the Cc for that patch, I would
have yelled at them and told them that's exactly what they needed to do.

Perhaps I should add something to keep any tracepoint from having more
than 6 arguments. That should force a clean up quickly.

I think I may start doing that.

-- Steve

Re: [PATCH 3/3] tracing: Rewrite filter logic to be simpler and faster

2018-03-09 Thread Steven Rostedt

On Fri, 9 Mar 2018 22:15:23 -0500
Steven Rostedt <rost...@goodmis.org> wrote:

> Sorry for the spam.

A little more spam ;-)

I know what happened, as I'm able to recreate it. For those that use
claws-mail, be careful.

I clicked on the email I wanted to reply to. Then I must have hit the
down arrow key, as it moved the selected email to the next email. I hit
"Reply All", and it showed the email I clicked on (as well as the
subject), but the Cc list belonged to the email that was selected by the
down arrow key.

That's a nasty subtle bug. I think I'll go report this to the claws
folks.

-- Steve

Re: [PATCH 3/3] tracing: Rewrite filter logic to be simpler and faster

2018-03-09 Thread Steven Rostedt

I don't know what the hell happened, but claws mail just inserted a
ton of people into the Cc (I didn't add you). I noticed it just after I
hit send. The added Cc looks like it came from the email right after the
email I was replying to "Subject: Re: [PATCH v3] kernel.h: Skip
single-eval logic on literals in min()/max()".

Sorry for the spam.

-- Steve

On Fri, 9 Mar 2018 22:10:19 -0500
Steven Rostedt <rost...@goodmis.org> wrote:

> On Fri, 09 Mar 2018 21:34:45 -0500
> Steven Rostedt <rost...@goodmis.org> wrote:
> 
> >  2 files changed, 1050 insertions(+), 1273 deletions(-)  
> 
> BTW, it's a bit bigger impact than 223 deletions. As I added a lot of
> comments to explain the logic better. Removing comments and white space
> from the modifications we have:
> 
>  649 insertions(+), 1035 deletions(-)
> 
> That's 386 lines of code removed.
> 
> -- Steve

'

Re: [PATCH 3/3] tracing: Rewrite filter logic to be simpler and faster

2018-03-09 Thread Steven Rostedt

On Fri, 09 Mar 2018 21:34:45 -0500
Steven Rostedt <rost...@goodmis.org> wrote:

>  2 files changed, 1050 insertions(+), 1273 deletions(-)

BTW, it's a bit bigger impact than 223 deletions. As I added a lot of
comments to explain the logic better. Removing comments and white space
from the modifications we have:

 649 insertions(+), 1035 deletions(-)

That's 386 lines of code removed.

-- Steve

Re: [PATCH 0/3] Remove accidental VLA usage

2018-03-08 Thread Steven Rostedt

On Thu, 8 Mar 2018 09:02:36 -0600
Josh Poimboeuf  wrote:

> On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote:
> > This series adds SIMPLE_MAX() to be used in places where a stack array
> > is actually fixed, but the compiler still warns about VLA usage due to
> > confusion caused by the safety checks in the max() macro.
> > 
> > I'm sending these via -mm since that's where I've introduced SIMPLE_MAX(),
> > and they should all have no operational differences.  
> 
> What if we instead simplify the max() macro's type checking so that GCC
> can more easily fold the array size constants?  The below patch seems to
> work:

Nice. Have you tried to do a allmodconfig and build on various archs?

Of course pushing it to kernel.org will have the zero day bot do it for
you ;-)

-- Steve

Re: [PATCH bpf-next 0/5] bpf, tracing: introduce bpf raw tracepoints

2018-03-05 Thread Steven Rostedt

On Mon, 5 Mar 2018 14:36:07 +0100
Daniel Borkmann  wrote:

> Ping, Peter/Steven. If you have a chance, please review the series.

You're not off my radar, but I'm doing a lot of traveling for the next
two weeks (started last week). I'll see if I can find some time to look
at them. I scanned them over once, and they look interesting ;-)

-- Steve

Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Steven Rostedt

On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov  wrote:

> I don't think that's the case. My reading of current
> trace_kprobe_ftrace() -> arch_check_ftrace_location()
> is that it will not be true for old mcount case.

In the old mcount case, you can't use ftrace to return without calling
the function. That is, no modification of the return ip, unless you
created a trampoline that could handle arbitrary stack frames, and
remove them from the stack before returning back to the function.

> 
> As far as the rest of your arguments it very much puzzles me that
> you claim that this patch suppose to work based on historical
> reasoning whereas you did NOT test it.

I believe that Masami is saying that the modification of the IP from
kprobes has been very well tested. But I'm guessing that you still want
a test case for using kprobes in this particular instance. It's not the
implementation of modifying the IP that you are worried about, but the
implementation of BPF using it in this case. Right?

-- Steve

Re: [PATCH net-next v4.1 5/6] net: dccp: Add DCCP sendmsg trace event

2017-12-21 Thread Steven Rostedt

On Thu, 21 Dec 2017 10:57:36 -0500 (EST)
David Miller  wrote:

> When you fix any part of a patch series, you must always repost the
> entire series from scratch, not just the patch(s) that change.

He probably gets that from me. I don't usually require a full series if
only a single patch changes. But I don't receive as many patch series
as you do, so I can handle that work flow without too much difficulty.

-- Steve

Re: [PATCH net-next v4 0/6] net: tcp: sctp: dccp: Replace jprobe usage with trace events

2017-12-20 Thread Steven Rostedt

On Thu, 21 Dec 2017 11:36:57 +0900
Masami Hiramatsu  wrote:

> On Wed, 20 Dec 2017 14:24:24 -0500 (EST)
> David Miller  wrote:
> 
> > From: David Miller 
> > Date: Wed, 20 Dec 2017 14:20:40 -0500 (EST)
> >   
> > > From: Masami Hiramatsu 
> > > Date: Wed, 20 Dec 2017 13:14:11 +0900
> > >   
> > >> This series is v4 of the replacement of jprobe usage with trace
> > >> events. This version is rebased on net-next, fixes a build warning
> > >> and moves a temporal variable definition in a block.
> > >> 
> > >> Previous version is here;
> > >> https://lkml.org/lkml/2017/12/19/153
> > >> 
> > >> Changes from v3:
> > >>   All: Rebased on net-next
> > >>   [3/6]: fixes a build warning for i386 by casting pointer unsigned
> > >> long instead of __u64, and moves a temporal variable
> > >>  definition in a block.  
> > > 
> > > Looks good, series applied to net-next, thanks.  
> > 
> > Actually, this doesn't even compile, so I've reverted:
> > 
> > [davem@dhcp-10-15-49-227 net-next]$ make -s -j16
> > In file included from net/dccp/trace.h:105:0,
> >  from net/dccp/proto.c:42:
> > ./include/trace/define_trace.h:89:42: fatal error: ./trace.h: No such file 
> > or directory
> >  #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
> >   ^
> > compilation terminated.  
> 
> Hi David,
> 
> Could you share your .config file? I would like to reproduce it.
> When I tried with attached kconfig, I could not reproduce this issue,
> and could run it.
>

Hi Masami,

Are you sure you committed everything in this change set? I don't see a
modification of the Makefile to get the trace.h file you created.
Shouldn't there be something like:

CFLAGS_proto.o := -I$(src)

in the Makefile?

-- Steve

Re: [PATCH -tip v3 3/6] net: sctp: Add SCTP ACK tracking trace event

2017-12-19 Thread Steven Rostedt

On Tue, 19 Dec 2017 17:58:25 +0900
Masami Hiramatsu  wrote:

> +TRACE_EVENT(sctp_probe,
> +
> + TP_PROTO(const struct sctp_endpoint *ep,
> +  const struct sctp_association *asoc,
> +  struct sctp_chunk *chunk),
> +
> + TP_ARGS(ep, asoc, chunk),
> +
> + TP_STRUCT__entry(
> + __field(__u64, asoc)
> + __field(__u32, mark)
> + __field(__u16, bind_port)
> + __field(__u16, peer_port)
> + __field(__u32, pathmtu)
> + __field(__u32, rwnd)
> + __field(__u16, unack_data)
> + ),
> +
> + TP_fast_assign(
> + struct sctp_transport *sp;
> + struct sk_buff *skb = chunk->skb;
> +
> + __entry->asoc = (__u64)asoc;
> + __entry->mark = skb->mark;
> + __entry->bind_port = ep->base.bind_addr.port;
> + __entry->peer_port = asoc->peer.port;
> + __entry->pathmtu = asoc->pathmtu;
> + __entry->rwnd = asoc->peer.rwnd;
> + __entry->unack_data = asoc->unack_data;
> +
> + if (trace_sctp_probe_path_enabled()) {
> + list_for_each_entry(sp, >peer.transport_addr_list,
> + transports) {
> + trace_sctp_probe_path(sp, asoc);
> + }
> + }

I thought you were going to move this into the code, like I suggested?

-- Steve

> + ),
> +
> + TP_printk("asoc=%#llx mark=%#x bind_port=%d peer_port=%d pathmtu=%d "
> +   "rwnd=%u unack_data=%d",
> +   __entry->asoc, __entry->mark, __entry->bind_port,
> +   __entry->peer_port, __entry->pathmtu, __entry->rwnd,
> +   __entry->unack_data)
> +);
> +

Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Steven Rostedt

On Tue, 19 Dec 2017 14:00:11 +1100
"Tobin C. Harding"  wrote:

> I ran through these as outlined here for the new version (v4). This hits
> the modified code but doesn't test symbol look up failure.

stacktrace shouldn't post non kernel values, unless there's a frame
pointer that isn't handled by kallsyms.

As for the other two, we could probably force a failure, like:

 # echo 'hist:keys=hrtimer.sym' > \
 events/timer/hrtimer_start/trigger
 # cat events/timer/hrtimer_start/hist

And then just add sym-offset too.

> 
> I also configured kernel with 'Perform a startup test on ftrace' for
> good luck.
> 
> Are you happy with this level of testing?

Can you try the above.

-- Steve

Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Steven Rostedt

On Tue, 19 Dec 2017 08:16:14 +1100
"Tobin C. Harding"  wrote:

> > >  #endif /* _LINUX_KERNEL_TRACE_H */
> > > diff --git a/kernel/trace/trace_events_hist.c 
> > > b/kernel/trace/trace_events_hist.c
> > > index 1e1558c99d56..3e28522a76f4 100644
> > > --- a/kernel/trace/trace_events_hist.c
> > > +++ b/kernel/trace/trace_events_hist.c
> > > @@ -982,7 +982,7 @@ static void hist_trigger_stacktrace_print(struct 
> > > seq_file *m,
> > >   return;
> > >  
> > >   seq_printf(m, "%*c", 1 + spaces, ' ');
> > > - sprint_symbol(str, stacktrace_entries[i]);
> > > + trace_sprint_symbol_addr(str, stacktrace_entries[i]);  
> > 

> 
> If you have the time to give me some brief pointers on how I should go
> about testing this I'd love to test it before the next version. I know
> very little about ftrace.

For hitting the histogram stacktrace trigger (this code path), make
sure you have CONFIG_HIST_TRIGGERS enabled. And then do:

 # cd /sys/kernel/debug/tracing
 # echo 'hist:keys=common_pid.execname,stacktrace:vals=prev_state' > \
 events/sched/sched_switch/trigger
 # cat events/sched/sched_switch/hist

For the "sym" part, you can do (from the same directory):

 # echo 'hist:keys=call_site.sym:vals=bytes_req' > \
 events/kmem/kmalloc/trigger
 # cat events/kmem/kmalloc/hist


And for sym-offset:

 # echo 'hist:keys=call_site.sym-offset:vals=bytes_req' > \
events/kmem/kmalloc/trigger
 # cat events/kmem/kmalloc/hist

-- Steve

Re: [PATCH 1/3] kallsyms: don't leak address when symbol not found

2017-12-18 Thread Steven Rostedt

On Tue, 19 Dec 2017 09:41:29 +1100
"Tobin C. Harding"  wrote:

> Current suggestion on list is to remove this function. Do you have a use
> case in mind where debugging will break? We could add a fix to this
> series if so. Otherwise next version will likely drop
> string_is_no_symbol()

What about adding a kernel command line parameter that lets one put
back the old behavior.

"insecure_print_all_symbols" ?

-- Steve

Re: [v2 PATCH -tip 3/6] net: sctp: Add SCTP ACK tracking trace event

2017-12-18 Thread Steven Rostedt

On Mon, 18 Dec 2017 17:12:15 +0900
Masami Hiramatsu  wrote:

> Add SCTP ACK tracking trace event to trace the changes of SCTP
> association state in response to incoming packets.
> It is used for debugging SCTP congestion control algorithms,
> and will replace sctp_probe module.
> 
> Note that this event a bit tricky. Since this consists of 2
> events (sctp_probe and sctp_probe_path) so you have to enable
> both events as below.
> 
>   # cd /sys/kernel/debug/tracing
>   # echo 1 > events/sctp/sctp_probe/enable
>   # echo 1 > events/sctp/sctp_probe_path/enable
> 
> Or, you can enable all the events under sctp.
> 
>   # echo 1 > events/sctp/enable
> 
> Since sctp_probe_path event is always invoked from sctp_probe
> event, you can not see any output if you only enable
> sctp_probe_path.

I have to ask, why did you do it this way?


> +#include 
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 8f8ccded13e4..c5f92b2cc5c3 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -59,6 +59,9 @@
>  #include 
>  #include 
>  
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  static struct sctp_packet *sctp_abort_pkt_new(
>   struct net *net,
>   const struct sctp_endpoint *ep,
> @@ -3219,6 +3222,8 @@ enum sctp_disposition sctp_sf_eat_sack_6_2(struct net 
> *net,
>   struct sctp_sackhdr *sackh;
>   __u32 ctsn;
>  
> + trace_sctp_probe(ep, asoc, chunk);

What about doing this right after this probe:

if (trace_sctp_probe_path_enabled()) {
struct sctp_transport *sp;

list_for_each_entry(sp, >peer.transpor_addr_list,
transports) {
trace_sctp_probe_path(sp, asoc);
}
}

The "trace_sctp_probe_path_enabled()" is a static branch, which means
it's a nop just like a tracepoint is, and will not add any overhead if
the trace_sctp_probe_path is not enabled.

-- Steve

> +
>   if (!sctp_vtag_verify(chunk, asoc))
>   return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
>

Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Steven Rostedt

On Mon, 18 Dec 2017 10:53:32 +1100
"Tobin C. Harding"  wrote:

> Fixes behaviour modified by: commit bd6b239cdbb2 ("kallsyms: don't leak
> address when symbol not found")
> 
> Previous patch changed behaviour of kallsyms function sprint_symbol() to
> return an error code instead of printing the address if a symbol was not
> found. Ftrace relies on the original behaviour. We should not break
> tracing when applying the previous patch. We can maintain the original
> behaviour by checking the return code on calls to sprint_symbol() and
> friends.
> 
> Check return code and print actual address on error (i.e symbol not
> found).
> 
> Signed-off-by: Tobin C. Harding 
> ---
>  kernel/trace/trace.h | 24 
>  kernel/trace/trace_events_hist.c |  6 +++---
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 2a6d0325a761..881b1a577d75 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -1814,4 +1814,28 @@ static inline void trace_event_eval_update(struct 
> trace_eval_map **map, int len)
>  
>  extern struct trace_iterator *tracepoint_print_iter;
>  
> +static inline int
> +trace_sprint_symbol(char *buffer, unsigned long address)
> +{
> + int ret;
> +
> + ret = sprint_symbol(buffer, address);
> + if (ret == -1)
> + ret = sprintf(buffer, "0x%lx", address);
> +
> + return ret;
> +}
> +
> +static inline int
> +trace_sprint_symbol_no_offset(char *buffer, unsigned long address)
> +{
> + int ret;
> +
> + ret = sprint_symbol_no_offset(buffer, address);
> + if (ret == -1)
> + ret = sprintf(buffer, "0x%lx", address);
> +
> + return ret;
> +}
> +
>  #endif /* _LINUX_KERNEL_TRACE_H */
> diff --git a/kernel/trace/trace_events_hist.c 
> b/kernel/trace/trace_events_hist.c
> index 1e1558c99d56..3e28522a76f4 100644
> --- a/kernel/trace/trace_events_hist.c
> +++ b/kernel/trace/trace_events_hist.c
> @@ -982,7 +982,7 @@ static void hist_trigger_stacktrace_print(struct seq_file 
> *m,
>   return;
>  
>   seq_printf(m, "%*c", 1 + spaces, ' ');
> - sprint_symbol(str, stacktrace_entries[i]);
> + trace_sprint_symbol_addr(str, stacktrace_entries[i]);

Hmm, where is trace_sprint_symbol_addr() defined?

-- Steve

>   seq_printf(m, "%s\n", str);
>   }
>  }
> @@ -1014,12 +1014,12 @@ hist_trigger_entry_print(struct seq_file *m,
>   seq_printf(m, "%s: %llx", field_name, uval);
>   } else if (key_field->flags & HIST_FIELD_FL_SYM) {
>   uval = *(u64 *)(key + key_field->offset);
> - sprint_symbol_no_offset(str, uval);
> + trace_sprint_symbol_no_offset(str, uval);
>   seq_printf(m, "%s: [%llx] %-45s", field_name,
>  uval, str);
>   } else if (key_field->flags & HIST_FIELD_FL_SYM_OFFSET) {
>   uval = *(u64 *)(key + key_field->offset);
> - sprint_symbol(str, uval);
> + trace_sprint_symbol(str, uval);
>   seq_printf(m, "%s: [%llx] %-55s", field_name,
>  uval, str);
>   } else if (key_field->flags & HIST_FIELD_FL_EXECNAME) {

Re: [PATCH] bpf: Fix compile warnings when !CONFIG_BPF_SYSCALL

2017-12-01 Thread Steven Rostedt

On Fri, 1 Dec 2017 12:06:05 -0700
Jason Gunthorpe  wrote:

> Such as:
> 
> In file included from ./include/trace/events/xdp.h:10:0,
>  from ./include/linux/bpf_trace.h:6,
>  from drivers/net/ethernet/intel/i40e/i40e_txrx.c:29:
> ./include/trace/events/xdp.h:94:17: warning: ‘struct bpf_map’ declared inside 
> parameter list
> const struct bpf_map *map, u32 map_index),
>  ^
> 
> By adding a forward declaration for struct bpf_map. In the
> CONFIG_BPF_SYSCALL case the declaration comes in via
> trace/events/bpf.h

This was already fixed:

 http://lkml.kernel.org/r/1512006089-180279-1-git-send-email-xiexi...@huawei.com

-- Steve

> 
> Fixes: 59a308967589 ("xdp: separate xdp_redirect tracepoint in map case")
> Signed-off-by: Jason Gunthorpe 
> ---
>  include/trace/events/xdp.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index 4cd0f05d01134d..36b2a9043189be 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -9,6 +9,8 @@
>  #include 
>  #include 
>  
> +struct bpf_map;
> +
>  #define __XDP_ACT_MAP(FN)\
>   FN(ABORTED) \
>   FN(DROP)\

Re: [PATCH V11 2/5] vsprintf: refactor %pK code out of pointer()

2017-11-29 Thread Steven Rostedt

On Wed, 29 Nov 2017 15:27:46 +1100
"Tobin C. Harding" <m...@tobin.cc> wrote:

> On Tue, Nov 28, 2017 at 09:39:57PM -0500, Steven Rostedt wrote:
> > On Wed, 29 Nov 2017 13:05:02 +1100
> > "Tobin C. Harding" <m...@tobin.cc> wrote:
> >   
> > > + /*
> > > +  * kptr_restrict==1 cannot be used in IRQ context
> > > +  * because its test for CAP_SYSLOG would be meaningless.
> > > +  */
> > > + if (in_irq() || in_serving_softirq() || in_nmi())  
> > 
> > This could be replaced with:
> > 
> > if (!in_task())
> > 
> > Which is actually more efficient.  
> 
> thanks for the comment Steve. At this late stage in the game do you mind
> if I don't include this change in this set. The code line in question is
> only in the series because of refactoring. I'm comfortable arguing that
> improving efficiency is out of scope ;)

No problem. In fact, I can send this as a separate patch myself, on top
of your series.

-- Steve

Re: [PATCH V11 2/5] vsprintf: refactor %pK code out of pointer()

2017-11-28 Thread Steven Rostedt

On Wed, 29 Nov 2017 13:05:02 +1100
"Tobin C. Harding"  wrote:

> + /*
> +  * kptr_restrict==1 cannot be used in IRQ context
> +  * because its test for CAP_SYSLOG would be meaningless.
> +  */
> + if (in_irq() || in_serving_softirq() || in_nmi())

This could be replaced with:

if (!in_task())

Which is actually more efficient.

-- Steve

> + return string(buf, end, "pK-error", spec);

Re: [PATCH] tcp: Export to userspace the TCP state names for the trace events

2017-11-10 Thread Steven Rostedt

On Sat, 11 Nov 2017 02:06:00 +
Yafang Shao <laoar.s...@gmail.com> wrote:

> 2017-11-10 15:07 GMT+00:00 Steven Rostedt <rost...@goodmis.org>:
> > On Fri, 10 Nov 2017 12:56:06 +0800
> > Yafang Shao <laoar.s...@gmail.com> wrote:
> >  
> >> Could the macro tcp_state_name() be renamed ？
> >> If  is included in include/net/tcp.h, it will  
> >
> > Ideally, you don't want to include trace/events/*.h headers in other
> > headers, as they can have side effects if those headers are included in
> > other trace/events/*.h headers.
> >  
> 
> Actually I find trace/events/*.h is included in lots of other headers,
> for example,
> 
> net/rxrpc/ar-internal.h

This is an internal header, so it's not that likely to be used where it
shouldn't be.

> include/linux/bpf_trace.h
> fs/f2fs/trace.h

The above two are actually headers specifically used to pull in the
trace/events/*.h headers.

> fs/afs/internal.h

another internal header. Unlikely to be misused.

> arch/x86/include/asm/mmu_context.h

This one, hmm, probably should be fixed.

> ...
> 
> Are these files doing properly ?

Most yes, some probably not.

> Should we fix them ?

Probably, but if they are used incorrectly, it would usually fail on
build (The same global functions and variables would be defined).

> 
> But per my understanding, it is ok to include  trace/events/*.h in
> other headers because we defined TRACE_SYSTEM as well, as a
> consequence those headers should not included in trace/events/*.h. If
> that happens, it may means that one of the these two TRACE_SYSTEM is
> not defined properly. Maybe these two TRACE_SYSTEM should be merged to
> one TRACE_SYSTEM.

Two different files may have the same TRACE_SYSTEM defined. That's not
an issue.

The issue is, if you have a trace/events/*.h header in a popular file
(like it use to be in include/linux/slab.h), then it can cause issues
if another trace/events/*.h header includes it. That's because each
trace/events/*.h header must be included with CREATE_TRACE_POINTS only
once.

> 
> 
> >> cause compile error, because there's another function tcp_state_name()
> >> defined in net/netfilter/ipvs/ip_vs_proto_tcp.c.
> >> static const char * tcp_state_name(int state)
> >> {
> >>
> >> if (state >= IP_VS_TCP_S_LAST)
> >>
> >> return "ERR!";
> >>
> >> return tcp_state_name_table[state] ? tcp_state_name_table[state] : 
> >> "?";
> >>
> >> }  
> >
> > But that said, I didn't make up the trace_state_name(), it was already
> > there in net-next before this patch.
> >  
> 
> I know that is not your fault.

:-)

> But as you are modifying this file, it is better to modify it in your
> patch as well.
> So we need not submit another new patch to fix it.

I could whip up a patch 2.

> 
> > But yeah, in actuality, I would have just done:
> >
> > #define EM(a)   { a, #a },
> > #define EMe(a)  { a, #a }
> >
> > directly. Which we can still do.
> >
> > -- Steve
> >  
> 
> The suggestion from Song is good to fix it.

Song's suggestion seems like it can simple be a patch added on top of
mine. As it is somewhat agnostic to the fix I'm making. That is, it's a
different problem, and thus should be a different patch.

-- Steve

Re: [PATCH] tcp: Export to userspace the TCP state names for the trace events

2017-11-10 Thread Steven Rostedt

On Fri, 10 Nov 2017 12:56:06 +0800
Yafang Shao  wrote:

> Could the macro tcp_state_name() be renamed ？
> If  is included in include/net/tcp.h, it will

Ideally, you don't want to include trace/events/*.h headers in other
headers, as they can have side effects if those headers are included in
other trace/events/*.h headers.

> cause compile error, because there's another function tcp_state_name()
> defined in net/netfilter/ipvs/ip_vs_proto_tcp.c.
> static const char * tcp_state_name(int state)
> {
> 
> if (state >= IP_VS_TCP_S_LAST)
> 
> return "ERR!";
> 
> return tcp_state_name_table[state] ? tcp_state_name_table[state] : 
> "?";
> 
> }

But that said, I didn't make up the trace_state_name(), it was already
there in net-next before this patch.

But yeah, in actuality, I would have just done:

#define EM(a)   { a, #a },
#define EMe(a)  { a, #a }

directly. Which we can still do.

-- Steve

[PATCH] tcp: Export to userspace the TCP state names for the trace events

2017-11-09 Thread Steven Rostedt


From: "Steven Rostedt (VMware)" <rost...@goodmis.org>

The TCP trace events (specifically tcp_set_state), maps emums to symbol
names via __print_symbolic(). But this only works for reading trace events
from the tracefs trace files. If perf or trace-cmd were to record these
events, the event format file does not convert the enum names into numbers,
and you get something like:

__print_symbolic(REC->oldstate,
{ TCP_ESTABLISHED, "TCP_ESTABLISHED" },
{ TCP_SYN_SENT, "TCP_SYN_SENT" },
{ TCP_SYN_RECV, "TCP_SYN_RECV" },
{ TCP_FIN_WAIT1, "TCP_FIN_WAIT1" },
{ TCP_FIN_WAIT2, "TCP_FIN_WAIT2" },
{ TCP_TIME_WAIT, "TCP_TIME_WAIT" },
{ TCP_CLOSE, "TCP_CLOSE" },
{ TCP_CLOSE_WAIT, "TCP_CLOSE_WAIT" },
{ TCP_LAST_ACK, "TCP_LAST_ACK" },
{ TCP_LISTEN, "TCP_LISTEN" },
{ TCP_CLOSING, "TCP_CLOSING" },
{ TCP_NEW_SYN_RECV, "TCP_NEW_SYN_RECV" })

Where trace-cmd and perf do not know the values of those enums.

Use the TRACE_DEFINE_ENUM() macros that will have the trace events convert
the enum strings into their values at system boot. This will allow perf and
trace-cmd to see actual numbers and not enums:

__print_symbolic(REC->oldstate,
{ 1, "TCP_ESTABLISHED" },
{ 2, "TCP_SYN_SENT" },
{ 3, "TCP_SYN_RECV" },
{ 4, "TCP_FIN_WAIT1" },
{ 5, "TCP_FIN_WAIT2" },
{ 6, "TCP_TIME_WAIT" },
{ 7, "TCP_CLOSE" },
{ 8, "TCP_CLOSE_WAIT" },
{ 9, "TCP_LAST_ACK" },
{ 10, "TCP_LISTEN" },
{ 11, "TCP_CLOSING" },
{ 12, "TCP_NEW_SYN_RECV" })

Signed-off-by: Steven Rostedt (VMware) <rost...@goodmis.org>
---
 include/trace/events/tcp.h | 41 -
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 07a6cbf1..62e5bad7901f 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -9,21 +9,36 @@
 #include 
 #include 
 
+#define tcp_state_names\
+   EM(TCP_ESTABLISHED) \
+   EM(TCP_SYN_SENT)\
+   EM(TCP_SYN_RECV)\
+   EM(TCP_FIN_WAIT1)   \
+   EM(TCP_FIN_WAIT2)   \
+   EM(TCP_TIME_WAIT)   \
+   EM(TCP_CLOSE)   \
+   EM(TCP_CLOSE_WAIT)  \
+   EM(TCP_LAST_ACK)\
+   EM(TCP_LISTEN)  \
+   EM(TCP_CLOSING) \
+   EMe(TCP_NEW_SYN_RECV)
+
+/* enums need to be exported to user space */
+#undef EM
+#undef EMe
+#define EM(a) TRACE_DEFINE_ENUM(a);
+#define EMe(a)TRACE_DEFINE_ENUM(a);
+
+tcp_state_names
+
+#undef EM
+#undef EMe
+#define EM(a) tcp_state_name(a),
+#define EMe(a)tcp_state_name(a)
+
 #define tcp_state_name(state)  { state, #state }
 #define show_tcp_state_name(val)   \
-   __print_symbolic(val,   \
-   tcp_state_name(TCP_ESTABLISHED),\
-   tcp_state_name(TCP_SYN_SENT),   \
-   tcp_state_name(TCP_SYN_RECV),   \
-   tcp_state_name(TCP_FIN_WAIT1),  \
-   tcp_state_name(TCP_FIN_WAIT2),  \
-   tcp_state_name(TCP_TIME_WAIT),  \
-   tcp_state_name(TCP_CLOSE),  \
-   tcp_state_name(TCP_CLOSE_WAIT), \
-   tcp_state_name(TCP_LAST_ACK),   \
-   tcp_state_name(TCP_LISTEN), \
-   tcp_state_name(TCP_CLOSING),\
-   tcp_state_name(TCP_NEW_SYN_RECV))
+   __print_symbolic(val, tcp_state_names)
 
 /*
  * tcp event with arguments sk and skb
-- 
2.13.6

Re: [PATCH] net/tcp: introduce TRACE_EVENT for TCP/IPv4 state transition

2017-11-09 Thread Steven Rostedt

On Thu, 9 Nov 2017 23:40:13 +
Song Liu  wrote:

> > tcp_set_state uses __print_symbolic to show state in text format. I found
> > trace-cmd cannot parse that part:
> > 
> > [011] 147338.660560: tcp_set_state:sport=16262 dport=48346 \
> >saddr=127.0.0.6 daddr=127.0.0.6 saddrv6=2401:db00:30:317e:face:0:1f:0 \
> >daddrv6=2401:db00:30:31e5:face:0:7f:0 oldstate= newstate=

The latest trace-cmd does show oldstate=0xa newstate=0x7, since I fixed
it so undefined symbols and flags are displayed.

> > 
> > Other parts of the output looks good to me.
> > 
> > Thanks,
> > Song  
> 
> I am not sure whether this is the best approach, but the following patch 
> fixes the output of perf:

No it's not the best approach. But the below patch is ;-)

> 
>  0.44%  sport=16262 dport=39362 saddr=127.0.0.6 daddr=127.0.0.6 \
> saddrv6=2401:db00:30:317e:face:0:1f:0 daddrv6=2401:db00:30:31e5:face:0:7f:0 \
> oldstate=TCP_CLOSE_WAIT newstate=TCP_LAST_ACK
> 

I'll send a formal patch if you all approve.

-- Steve

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 07a6cbf1..62e5bad7901f 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -9,21 +9,36 @@
 #include 
 #include 
 
+#define tcp_state_names\
+   EM(TCP_ESTABLISHED) \
+   EM(TCP_SYN_SENT)\
+   EM(TCP_SYN_RECV)\
+   EM(TCP_FIN_WAIT1)   \
+   EM(TCP_FIN_WAIT2)   \
+   EM(TCP_TIME_WAIT)   \
+   EM(TCP_CLOSE)   \
+   EM(TCP_CLOSE_WAIT)  \
+   EM(TCP_LAST_ACK)\
+   EM(TCP_LISTEN)  \
+   EM(TCP_CLOSING) \
+   EMe(TCP_NEW_SYN_RECV)
+
+/* enums need to be exported to user space */
+#undef EM
+#undef EMe
+#define EM(a) TRACE_DEFINE_ENUM(a);
+#define EMe(a)TRACE_DEFINE_ENUM(a);
+
+tcp_state_names
+
+#undef EM
+#undef EMe
+#define EM(a) tcp_state_name(a),
+#define EMe(a)tcp_state_name(a)
+
 #define tcp_state_name(state)  { state, #state }
 #define show_tcp_state_name(val)   \
-   __print_symbolic(val,   \
-   tcp_state_name(TCP_ESTABLISHED),\
-   tcp_state_name(TCP_SYN_SENT),   \
-   tcp_state_name(TCP_SYN_RECV),   \
-   tcp_state_name(TCP_FIN_WAIT1),  \
-   tcp_state_name(TCP_FIN_WAIT2),  \
-   tcp_state_name(TCP_TIME_WAIT),  \
-   tcp_state_name(TCP_CLOSE),  \
-   tcp_state_name(TCP_CLOSE_WAIT), \
-   tcp_state_name(TCP_LAST_ACK),   \
-   tcp_state_name(TCP_LISTEN), \
-   tcp_state_name(TCP_CLOSING),\
-   tcp_state_name(TCP_NEW_SYN_RECV))
+   __print_symbolic(val, tcp_state_names)
 
 /*
  * tcp event with arguments sk and skb

Re: [PATCH] net/tcp: introduce TRACE_EVENT for TCP/IPv4 state transition

2017-11-09 Thread Steven Rostedt

On Thu, 9 Nov 2017 15:43:29 +0900
Alexei Starovoitov  wrote:

> > +TRACE_EVENT(tcp_set_state,
> > +   TP_PROTO(struct sock *sk, int oldstate, int newstate),
> > +   TP_ARGS(sk, oldstate, newstate),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(__be32, dst)
> > +   __field(__be32, src)
> > +   __field(__u16, dport)
> > +   __field(__u16, sport)
> > +   __field(int, oldstate)
> > +   __field(int, newstate)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   if (oldstate == TCP_TIME_WAIT) {
> > +   __entry->dst = inet_twsk(sk)->tw_daddr;
> > +   __entry->src = inet_twsk(sk)->tw_rcv_saddr;
> > +   __entry->dport = ntohs(inet_twsk(sk)->tw_dport);
> > +   __entry->sport = ntohs(inet_twsk(sk)->tw_sport);
> > +   } else if (oldstate == TCP_NEW_SYN_RECV) {
> > +   __entry->dst = inet_rsk(inet_reqsk(sk))->ir_rmt_addr;
> > +   __entry->src = inet_rsk(inet_reqsk(sk))->ir_loc_addr;
> > +   __entry->dport =
> > +   ntohs(inet_rsk(inet_reqsk(sk))->ir_rmt_port);
> > +   __entry->sport = inet_rsk(inet_reqsk(sk))->ir_num;
> > +   } else {
> > +   __entry->dst = inet_sk(sk)->inet_daddr;
> > +   __entry->src = inet_sk(sk)->inet_rcv_saddr;
> > +   __entry->dport = ntohs(inet_sk(sk)->inet_dport);
> > +   __entry->sport = ntohs(inet_sk(sk)->inet_sport);
> > +   }
> > +
> > +   __entry->oldstate = oldstate;
> > +   __entry->newstate = newstate;
> > +   ),
> > +
> > +   TP_printk("%08X:%04X %08X:%04X, %02x %02x",
> > +   __entry->src, __entry->sport, __entry->dst, __entry->dport,
> > +   __entry->oldstate, __entry->newstate)  
> 
> direct %x of state is not allowed.
> This has to use show_tcp_state_name() like it's done in trace_tcp_set_state

Hmm, I need to look at trace_tcp_set_state. I'm guessing it is in
net-next too?

> 
> Also I'm missing the reason to introduce another tracepoint
> that looks just like trace_tcp_set_state.

I want to make sure that perf and trace-cmd can parse the TP_printk(),
if it is having helper functions like that in the TP_printk() part,
then the libtraceevent needs to be aware of that.

-- Steve

Re: [PATCH] kallsyms: don't leak address when printing symbol

2017-11-09 Thread Steven Rostedt

On Thu, 9 Nov 2017 16:45:48 +1100
"Tobin C. Harding" <m...@tobin.cc> wrote:

> On Wed, Nov 08, 2017 at 10:35:55PM -0500, Steven Rostedt wrote:
> > On Thu,  9 Nov 2017 12:50:29 +1100
> > "Tobin C. Harding" <m...@tobin.cc> wrote:
> >   
> > > Currently if a pointer is printed using %p[ssB] and the symbol is not
> > > found (kallsyms_lookup() fails) then we print the actual address. This
> > > leaks kernel addresses. We should instead print something _safe_.
> > > 
> > > Print "" instead of kernel address.  
> > 
> > Ug, ftrace requires this to work as is, as it uses it to print some
> > addresses that may or may not be a symbol.
> > 
> > If anything, can this return a success or failure if it were to find a
> > symbol or not, and then something like ftrace could decide to use %x if
> > it does not.  
> 
> Thanks for the feed back Steve. Propagating the error back up through
> the call chain may require a little bit of thought so we don't upset the
> apple cart. sprint_symbol() never currently (as far as I can see)
> returns an error. I can go through the other callers of sprint_symbol()
> (there aren't many) and check if it is going to upset anything.

As the functions are currently void, adding a return value if it
doesn't print the symbol shouldn't break anything, unless the other
users required an address as well. Which stack traces might!

> 
> > And yes, ftrace leaks kernel addresses all over the place, that's just
> > the nature of tracing the kernel.  
> 
> Would it be good for you (for this change and future changes aimed at
> closing leaks) if any changes like this include patches to ftrace to
> maintain the current behaviour?
> 
> You have been on the CC list for the printk hashing and what not since
> the start I believe so you know I'm a noob, feel free to scream bloody
> murder if I'm breaching protocol.

No, you have been fine. I've been quickly scanning your patches looking
for things like this. Please keep Ccing me. I'm currently busy fighting
other fires, but I try to at least take a quick look at patches like
this. I may not reply, but you are on my radar ;-)

-- Steve

Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl

2017-11-09 Thread Steven Rostedt

On Thu, 9 Nov 2017 10:24:32 +0530
Kaiwan N Billimoria  wrote:

> On Thu, Nov 9, 2017 at 10:13 AM, Kaiwan N Billimoria
>  wrote:
> >> But I don't know if there is anything else than the profiling code
> >> that _really_ wants access to /proc/kallsyms in user space as a
> >> regular user.  
> >  
> 
> Front-ends to ftrace, like trace-cmd?
> [from the trace-cmd git repo (Steve Rostedt, pl stand up, pl stand up :-)
> Documentation/trace-cmd-restore.1.txt :

Yes, profiling and tracing are similar. And you need to be root to run
the recording anyway. Thus, as long as root user can read kallsyms,
trace-cmd should be fine. As trace-cmd requires root access to do any
ftrace tracing.

-- Steve

Re: [PATCH] kallsyms: don't leak address when printing symbol

2017-11-08 Thread Steven Rostedt

On Thu,  9 Nov 2017 12:50:29 +1100
"Tobin C. Harding"  wrote:

> Currently if a pointer is printed using %p[ssB] and the symbol is not
> found (kallsyms_lookup() fails) then we print the actual address. This
> leaks kernel addresses. We should instead print something _safe_.
> 
> Print "" instead of kernel address.

Ug, ftrace requires this to work as is, as it uses it to print some
addresses that may or may not be a symbol.

If anything, can this return a success or failure if it were to find a
symbol or not, and then something like ftrace could decide to use %x if
it does not.

And yes, ftrace leaks kernel addresses all over the place, that's just
the nature of tracing the kernel.

-- Steve


> 
> Signed-off-by: Tobin C. Harding 
> ---
>  kernel/kallsyms.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> index 127e7cfafa55..182e7592be9c 100644
> --- a/kernel/kallsyms.c
> +++ b/kernel/kallsyms.c
> @@ -390,7 +390,7 @@ static int __sprint_symbol(char *buffer, unsigned long 
> address,
>   address += symbol_offset;
>   name = kallsyms_lookup(address, , , , buffer);
>   if (!name)
> - return sprintf(buffer, "0x%lx", address - symbol_offset);
> + return sprintf(buffer, "");
>  
>   if (name != buffer)
>   strcpy(buffer, name);

Re: [PATCH v3] scripts: add leaking_addresses.pl

2017-11-07 Thread Steven Rostedt

On Tue, 7 Nov 2017 13:44:01 -0800
Linus Torvalds  wrote:

> > Looking other places that stand out, it seems like
> > /proc/lockdep_chains and /proc/lockdep (CONFIG_LOCKDEP=y) has a ton of
> > %p usage. It's unclear to me if a hash is sufficient for meaningful
> > debugging there?  
> 
> Maybe not, but that is also _so_ esoteric that I suspect the right fix
> is to just make it root-only readable.

Also note, I don't believe anyone should be running a LOCKDEP
configured kernel in a production (secured) environment. As it adds
quite a bit of overhead. It's something you run on test environments to
make sure it doesn't detect any possible deadlocks.

> 
> I've never used it, we should check with people who have. I get the
> feeling that this is purely for PeterZ debugging.

I've used it. But then again, I also debug lockdep ;-)

> 
> The very first commit that introduced that code actually has a
> 
> (FIXME: should go into debugfs)
> 
> so I suspect it never should have been user-readable to begin with. I
> guess it makes some things easier, but it really is *very* different
> from things like profiling.

Want me to whip up a patch to move the file?

-- Steve

> 
> Profiling you often *cannot* do as root - some things you profile
> really shouldn't be run as root, and might even refuse to do so. So
> requiring you to be root just to get a kernel profile is very bad.
> 
> But looking at lockdep stats? Yeah, 'sudo' isn't so big of a deal.
> 
>

Re: [PATCH net-next] ipv6: let trace_fib6_table_lookup() dereference the fib table

2017-10-19 Thread Steven Rostedt

On Thu, 19 Oct 2017 09:31:43 +0200
Paolo Abeni <pab...@redhat.com> wrote:

> The perf traces for ipv6 routing code show a relevant cost around
> trace_fib6_table_lookup(), even if no trace is enabled. This is
> due to the fib6_table de-referencing currently performed by the
> caller.
> 
> Let's the tracing code pay this overhead, passing to the trace
> helper the table pointer. This gives small but measurable
> performance improvement under UDP flood.
> 
> Signed-off-by: Paolo Abeni <pab...@redhat.com>
> ---

Acked-by: Steven Rostedt (VMware) <rost...@goodmis.org>

-- Steve

Re: [PATCH] tracing: bpf: Hide bpf trace events when they are not used

2017-10-16 Thread Steven Rostedt

On Mon, 16 Oct 2017 21:11:06 +0100 (WEST)
David Miller <da...@davemloft.net> wrote:

> From: Steven Rostedt <rost...@goodmis.org>
> Date: Mon, 16 Oct 2017 16:01:25 -0400
> 
> > On Mon, 16 Oct 2017 20:54:34 +0100 (WEST)
> > David Miller <da...@davemloft.net> wrote:
> >   
> >> Steven, I lost track of how this patch is being handled.
> >> 
> >> Do you want me to merge it via my net-next tree?  
> > 
> > If you want. I could take it too if you give me an ack. It's not
> > dependent on any other changes so it doesn't matter which way it goes. I
> > know Alexei was thinking about doing the same for xdp but those appear
> > to be used even without BPF_SYSCALLS.  
> 
> Ok, applied to my net-next tree and if you want to apply it to your's
> too, here is the ACK:
> 
> Acked-by: David S. Miller <da...@davemloft.net>

Thanks! But if you are taking it, I'm fine with that. It may be a while
before I get my tool applied that detects unused tracepoints at compile
time. I have it working at runtime, but rather have a compiler warning.

-- Steve

Re: [PATCH] tracing: bpf: Hide bpf trace events when they are not used

2017-10-16 Thread Steven Rostedt

On Mon, 16 Oct 2017 20:54:34 +0100 (WEST)
David Miller  wrote:

> Steven, I lost track of how this patch is being handled.
> 
> Do you want me to merge it via my net-next tree?

If you want. I could take it too if you give me an ack. It's not
dependent on any other changes so it doesn't matter which way it goes. I
know Alexei was thinking about doing the same for xdp but those appear
to be used even without BPF_SYSCALLS.

-- Steve

Re: [PATCH] tracing: bpf: Hide bpf trace events when they are not used

2017-10-12 Thread Steven Rostedt

On Thu, 12 Oct 2017 18:38:36 -0700
Alexei Starovoitov  wrote:

> actually just noticed that xdp tracepoints are not covered by ifdef.
> They depend on bpf_syscall too. So probably makes sense to wrap
> them together.
> bpf tracepoints are not being actively worked on whereas xdp tracepoints
> keep evolving quickly, so the best is probalby to go via net-next
> if you don't mind.

Hmm, they didn't trigger a warning, with the exception of
trace_xdp_redirect_map. I have code to check if tracepoints are used or
not, and it appears that the xdp can be used without BPF_SYSCALL.

I don't think they should be wrapped together until we know why they
are used. I can still take this patch and just not touch the xdp ones.

Note, my kernel was using trace_xdp_redirect_map_err,
trace_xdp_redirect_err, trace_xdp_redirect and trace_xdp_exception.

As they did appear.

-- Steve

Re: [PATCH] tracing: bpf: Hide bpf trace events when they are not used

2017-10-12 Thread Steven Rostedt

On Thu, 12 Oct 2017 18:14:52 -0700
Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote:

> On Thu, Oct 12, 2017 at 06:40:02PM -0400, Steven Rostedt wrote:
> > From: Steven Rostedt (VMware) <rost...@goodmis.org>
> > 
> > All the trace events defined in include/trace/events/bpf.h are only
> > used when CONFIG_BPF_SYSCALL is defined. But this file gets included by
> > include/linux/bpf_trace.h which is included by the networking code with
> > CREATE_TRACE_POINTS defined.
> > 
> > If a trace event is created but not used it still has data structures
> > and functions created for its use, even though nothing is using them.
> > To not waste space, do not define the BPF trace events in bpf.h unless
> > CONFIG_BPF_SYSCALL is defined.
> > 
> > Signed-off-by: Steven Rostedt (VMware) <rost...@goodmis.org>  
> 
> Looks fine.
> Acked-by: Alexei Starovoitov <a...@kernel.org>
> 
> I'm assuming you want to take it through tracing tree along
> with all other cleanups?

Either way is fine. I have a few other ones. I believe Paul is taking
the RCU patch. There's no dependency.

I'll take it if it is easier for you. I just need the ack.

-- Steve

[PATCH] tracing: bpf: Hide bpf trace events when they are not used

2017-10-12 Thread Steven Rostedt

From: Steven Rostedt (VMware) <rost...@goodmis.org>

All the trace events defined in include/trace/events/bpf.h are only
used when CONFIG_BPF_SYSCALL is defined. But this file gets included by
include/linux/bpf_trace.h which is included by the networking code with
CREATE_TRACE_POINTS defined.

If a trace event is created but not used it still has data structures
and functions created for its use, even though nothing is using them.
To not waste space, do not define the BPF trace events in bpf.h unless
CONFIG_BPF_SYSCALL is defined.

Signed-off-by: Steven Rostedt (VMware) <rost...@goodmis.org>
---
Index: linux-trace.git/include/trace/events/bpf.h
===
--- linux-trace.git.orig/include/trace/events/bpf.h
+++ linux-trace.git/include/trace/events/bpf.h
@@ -4,6 +4,9 @@
 #if !defined(_TRACE_BPF_H) || defined(TRACE_HEADER_MULTI_READ)
 #define _TRACE_BPF_H
 
+/* These are only used within the BPF_SYSCALL code */
+#ifdef CONFIG_BPF_SYSCALL
+
 #include 
 #include 
 #include 
@@ -345,7 +348,7 @@ TRACE_EVENT(bpf_map_next_key,
  __print_hex(__get_dynamic_array(nxt), __entry->key_len),
  __entry->key_trunc ? " ..." : "")
 );
-
+#endif /* CONFIG_BPF_SYSCALL */
 #endif /* _TRACE_BPF_H */
 
 #include 
Index: linux-trace.git/kernel/bpf/core.c
===
--- linux-trace.git.orig/kernel/bpf/core.c
+++ linux-trace.git/kernel/bpf/core.c
@@ -1498,5 +1498,8 @@ int __weak skb_copy_bits(const struct sk
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception);
 
+/* These are only used within the BPF_SYSCALL code */
+#ifdef CONFIG_BPF_SYSCALL
 EXPORT_TRACEPOINT_SYMBOL_GPL(bpf_prog_get_type);
 EXPORT_TRACEPOINT_SYMBOL_GPL(bpf_prog_put_rcu);
+#endif

Re: [PATCH 1/4] rcu: introduce noref debug

2017-10-06 Thread Steven Rostedt

On Fri,  6 Oct 2017 14:57:46 +0200
Paolo Abeni  wrote:

> We currently lack a debugging infrastructure to ensure that
> long-lived noref dst are properly handled - namely dropped
> or converted to a refcounted version before escaping the current
> RCU section.
> 
> This changeset implements such infra, tracking the noref pointer
> on a per CPU store and checking that all noref tracked at any
> given RCU nesting level are cleared before leaving such RCU
> section.
> 
> Each noref entity is identified using the entity pointer and an
> additional, optional, key/opaque pointer. This is needed to cope
> with usage case scenario where the same noref entity is stored
> in different places (e.g. noref dst on skb_clone()).
> 
> To keep the patch/implementation simple RCU_NOREF_DEBUG depends
> on PREEMPT_RCU=n in Kconfig.
> 
> Signed-off-by: Paolo Abeni 
> ---
>  include/linux/rcupdate.h | 11 ++
>  kernel/rcu/Kconfig.debug | 15 
>  kernel/rcu/Makefile  |  1 +
>  kernel/rcu/noref_debug.c | 89 
> 
>  4 files changed, 116 insertions(+)
>  create mode 100644 kernel/rcu/noref_debug.c
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index de50d8a4cf41..20c1ce08e3eb 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -60,6 +60,15 @@ void call_rcu_sched(struct rcu_head *head, rcu_callback_t 
> func);
>  void synchronize_sched(void);
>  void rcu_barrier_tasks(void);
>  
> +#ifdef CONFIG_RCU_NOREF_DEBUG
> +void rcu_track_noref(void *key, void *noref, bool track);
> +void __rcu_check_noref(void);
> +
> +#else
> +static inline void rcu_track_noref(void *key, void *noref, bool add) { }
> +static inline void __rcu_check_noref(void) { }
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_RCU
>  
>  void __rcu_read_lock(void);
> @@ -85,6 +94,7 @@ static inline void __rcu_read_lock(void)
>  
>  static inline void __rcu_read_unlock(void)
>  {
> + __rcu_check_noref();
>   if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
>   preempt_enable();
>  }
> @@ -723,6 +733,7 @@ static inline void rcu_read_unlock_bh(void)
>"rcu_read_unlock_bh() used illegally while idle");
>   rcu_lock_release(_bh_lock_map);
>   __release(RCU_BH);
> + __rcu_check_noref();
>   local_bh_enable();
>  }
>  
> diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
> index 0ec7d1d33a14..6c7f52a3e809 100644
> --- a/kernel/rcu/Kconfig.debug
> +++ b/kernel/rcu/Kconfig.debug
> @@ -68,6 +68,21 @@ config RCU_TRACE
> Say Y here if you want to enable RCU tracing
> Say N if you are unsure.
>  
> +config RCU_NOREF_DEBUG
> + bool "Debugging for RCU-protected noref entries"
> + depends on PREEMPT_RCU=n
> + select PREEMPT_COUNT
> + default n
> + help
> +   In case of long lasting rcu_read_lock sections this debug
> +   feature enables one to ensure that no rcu managed dereferenced
> +   data leaves the locked section. One use case is the tracking
> +   of dst_entries in struct sk_buff ->dst, which is used to pass
> +   the dst_entry around in the whole stack while under RCU lock.
> +
> +   Say Y here if you want to enable noref RCU debugging
> +   Say N if you are unsure.
> +
>  config RCU_EQS_DEBUG
>   bool "Provide debugging asserts for adding NO_HZ support to an arch"
>   depends on DEBUG_KERNEL
> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> index 13c0fc852767..c67d7c65c582 100644
> --- a/kernel/rcu/Makefile
> +++ b/kernel/rcu/Makefile
> @@ -11,3 +11,4 @@ obj-$(CONFIG_TREE_RCU) += tree.o
>  obj-$(CONFIG_PREEMPT_RCU) += tree.o
>  obj-$(CONFIG_TINY_RCU) += tiny.o
>  obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
> +obj-$(CONFIG_RCU_NOREF_DEBUG) += noref_debug.o
> \ No newline at end of file
> diff --git a/kernel/rcu/noref_debug.c b/kernel/rcu/noref_debug.c
> new file mode 100644
> index ..ae2e104b11d6
> --- /dev/null
> +++ b/kernel/rcu/noref_debug.c
> @@ -0,0 +1,89 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define NOREF_MAX 7
> +struct noref_entry {
> + void *noref;
> + void *key;
> + int nesting;
> +};
> +
> +struct noref_cache {
> + struct noref_entry store[NOREF_MAX];
> +};
> +
> +static DEFINE_PER_CPU(struct noref_cache, per_cpu_noref_cache);
> +
> +static int noref_cache_lookup(struct noref_cache *cache, void *noref, void 
> *key)
> +{
> + int i;
> +
> + for (i = 0; i < NOREF_MAX; ++i)
> + if (cache->store[i].noref == noref &&
> + cache->store[i].key == key)
> + return i;
> +
> + return -1;
> +}
> +

Please add a comment above this function on how to use it.

-- Steve

> +void rcu_track_noref(void *key, void *noref, bool track)
> +{
> + struct noref_cache *cache = this_cpu_ptr(_cpu_noref_cache);
> + int i;
> +
> + if (track) {
> + /* find the first empty slot */
>

Re: [PATCH net] bpf: one perf event close won't free bpf program attached by another perf event

2017-09-21 Thread Steven Rostedt

On Thu, 21 Sep 2017 13:17:06 +0200
Peter Zijlstra  wrote:

> I suspect that would break a fair bunch of bpf proglets, since the data
> access to the trace data would be completely different, but it would be
> much nicer to not have this distinction based on event type.

Maybe this would be a good test to see if bpf is considered an ABI or
not.

-- Steve

Re: [PATCH net] bpf: one perf event close won't free bpf program attached by another perf event

2017-09-20 Thread Steven Rostedt

On Mon, 18 Sep 2017 16:38:36 -0700
Yonghong Song  wrote:

> This patch fixes a bug exhibited by the following scenario:
>   1. fd1 = perf_event_open with attr.config = ID1
>   2. attach bpf program prog1 to fd1
>   3. fd2 = perf_event_open with attr.config = ID1
>  
>   4. user program closes fd2 and prog1 is detached from the tracepoint.
>   5. user program with fd1 does not work properly as tracepoint
>  no output any more.
> 
> The issue happens at step 4. Multiple perf_event_open can be called
> successfully, but only one bpf prog pointer in the tp_event. In the
> current logic, any fd release for the same tp_event will free
> the tp_event->prog.
> 
> The fix is to free tp_event->prog only when the closing fd
> corresponds to the one which registered the program.
> 
> Signed-off-by: Yonghong Song 
> ---
>  Additional context: discussed with Alexei internally but did not find
>  a solution which can avoid introducing the additional field in
>  trace_event_call structure.
> 
>  Peter, could you take a look as well and maybe you could have better
>  alternative? Thanks!
> 
>  include/linux/trace_events.h | 1 +
>  kernel/events/core.c | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 7f11050..2e0f222 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -272,6 +272,7 @@ struct trace_event_call {
>   int perf_refcount;
>   struct hlist_head __percpu  *perf_events;
>   struct bpf_prog *prog;
> + struct perf_event   *bpf_prog_owner;

Does this have to be in the trace_event_call structure? Hmm, I'm
wondering if the prog needs to be there (I should look to see if we can
move it from it). The trace_event_call is created for *every* event,
and there's thousands of them now. Every byte to this structure adds
1000s of bytes to the kernel. Would it be possible to attach the prog
and the owner to the perf_event?

-- Steve


>  
>   int (*perf_perm)(struct trace_event_call *,
>struct perf_event *);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3e691b7..6bc21e2 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8171,6 +8171,7 @@ static int perf_event_set_bpf_prog(struct perf_event 
> *event, u32 prog_fd)
>   }
>   }
>   event->tp_event->prog = prog;
> + event->tp_event->bpf_prog_owner = event;
>  
>   return 0;
>  }
> @@ -8185,7 +8186,7 @@ static void perf_event_free_bpf_prog(struct perf_event 
> *event)
>   return;
>  
>   prog = event->tp_event->prog;
> - if (prog) {
> + if (prog && event->tp_event->bpf_prog_owner == event) {
>   event->tp_event->prog = NULL;
>   bpf_prog_put(prog);
>   }

Re: [PATCH 02/12] trace: Make trace_hwlat timestamp y2038 safe

2017-04-07 Thread Steven Rostedt

On Fri,  7 Apr 2017 17:57:00 -0700
Deepa Dinamani <deepa.ker...@gmail.com> wrote:

> struct timespec is not y2038 safe on 32 bit machines
> and needs to be replaced by struct timespec64
> in order to represent times beyond year 2038 on such
> machines.
> 
> Fix all the timestamp representation in struct trace_hwlat
> and all the corresponding implementations.
> 

> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 02a4aeb..08f9bab 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -4,7 +4,6 @@
>   * Copyright (C) 2008 Red Hat Inc, Steven Rostedt <srost...@redhat.com>
>   *
>   */
> -
>  #include 
>  #include 
>  #include 
> @@ -1161,11 +1160,11 @@ trace_hwlat_print(struct trace_iterator *iter, int 
> flags,
>  
>   trace_assign_type(field, entry);
>  
> - trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%ld.%09ld",
> + trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%lld.%09ld",
>field->seqnum,
>field->duration,
>field->outer_duration,
> -  field->timestamp.tv_sec,
> +  (long long)field->timestamp.tv_sec,

Refresh my memory. We need the cast because on 64 bit boxes
timestamp.tv_sec is just a long?

Other than that.

Reviewed-by: Steven Rostedt (VMware) <rost...@goodmis.org>

-- Steve

>field->timestamp.tv_nsec);
>  
>   if (field->nmi_count) {
> @@ -1195,10 +1194,10 @@ trace_hwlat_raw(struct trace_iterator *iter, int 
> flags,
>  
>   trace_assign_type(field, iter->ent);
>  
> - trace_seq_printf(s, "%llu %lld %ld %09ld %u\n",
> + trace_seq_printf(s, "%llu %lld %lld %09ld %u\n",
>field->duration,
>field->outer_duration,
> -  field->timestamp.tv_sec,
> +  (long long)field->timestamp.tv_sec,
>field->timestamp.tv_nsec,
>field->seqnum);
>

Re: [PATCH v3] tracing/kprobes: expose maxactive for kretprobe in kprobe_events

2017-04-04 Thread Steven Rostedt

On Tue, 4 Apr 2017 20:24:59 +0900
Masami Hiramatsu  wrote:

> On Mon,  3 Apr 2017 12:36:22 +0200
> Alban Crequy  wrote:
> 
> > From: Alban Crequy 
> > 
> > When a kretprobe is installed on a kernel function, there is a maximum
> > limit of how many calls in parallel it can catch (aka "maxactive"). A
> > kernel module could call register_kretprobe() and initialize maxactive
> > (see example in samples/kprobes/kretprobe_example.c).
> > 
> > But that is not exposed to userspace and it is currently not possible to
> > choose maxactive when writing to /sys/kernel/debug/tracing/kprobe_events
> > 
> > The default maxactive can be as low as 1 on single-core with a
> > non-preemptive kernel. This is too low and we need to increase it not
> > only for recursive functions, but for functions that sleep or resched.
> > 
> > This patch updates the format of the command that can be written to
> > kprobe_events so that maxactive can be optionally specified.
> > 
> > I need this for a bpf program attached to the kretprobe of
> > inet_csk_accept, which can sleep for a long time.
> > 
> > This patch includes a basic selftest:
> >   
> > > # ./ftracetest -v  test.d/kprobe/
> > > === Ftrace unit tests ===
> > > [1] Kprobe dynamic event - adding and removing[PASS]
> > > [2] Kprobe dynamic event - busy event check   [PASS]
> > > [3] Kprobe dynamic event with arguments   [PASS]
> > > [4] Kprobes event arguments with types[PASS]
> > > [5] Kprobe dynamic event with function tracer [PASS]
> > > [6] Kretprobe dynamic event with arguments[PASS]
> > > [7] Kretprobe dynamic event with maxactive[PASS]
> > >
> > > # of passed:  7
> > > # of failed:  0
> > > # of unresolved:  0
> > > # of untested:  0
> > > # of unsupported:  0
> > > # of xfailed:  0
> > > # of undefined(test bug):  0  
> > 
> > BugLink: https://github.com/iovisor/bcc/issues/1072
> > Signed-off-by: Alban Crequy   
> 
> Looks good to me.
> 
> Acked-by: Masami Hiramatsu 
> 

Applied, thanks!

-- Steve

Re: [PATCH v2] tracing/kprobes: expose maxactive for kretprobe in kprobe_events

2017-03-31 Thread Steven Rostedt

On Fri, 31 Mar 2017 15:20:24 +0200
Alban Crequy <alban.cre...@gmail.com> wrote:

> When a kretprobe is installed on a kernel function, there is a maximum
> limit of how many calls in parallel it can catch (aka "maxactive"). A
> kernel module could call register_kretprobe() and initialize maxactive
> (see example in samples/kprobes/kretprobe_example.c).
> 
> But that is not exposed to userspace and it is currently not possible to
> choose maxactive when writing to /sys/kernel/debug/tracing/kprobe_events
> 
> The default maxactive can be as low as 1 on single-core with a
> non-preemptive kernel. This is too low and we need to increase it not
> only for recursive functions, but for functions that sleep or resched.
> 
> This patch updates the format of the command that can be written to
> kprobe_events so that maxactive can be optionally specified.
> 
> I need this for a bpf program attached to the kretprobe of
> inet_csk_accept, which can sleep for a long time.
> 
> This patch includes a basic selftest:
> 
> > # ./ftracetest -v  test.d/kprobe/
> > === Ftrace unit tests ===
> > [1] Kprobe dynamic event - adding and removing  [PASS]
> > [2] Kprobe dynamic event - busy event check [PASS]
> > [3] Kprobe dynamic event with arguments [PASS]
> > [4] Kprobes event arguments with types  [PASS]
> > [5] Kprobe dynamic event with function tracer   [PASS]
> > [6] Kretprobe dynamic event with arguments  [PASS]
> > [7] Kretprobe dynamic event with maxactive  [PASS]
> >
> > # of passed:  7
> > # of failed:  0
> > # of unresolved:  0
> > # of untested:  0
> > # of unsupported:  0
> > # of xfailed:  0
> > # of undefined(test bug):  0  
> 
> BugLink: https://github.com/iovisor/bcc/issues/1072
> Signed-off-by: Alban Crequy <al...@kinvolk.io>
> 
> ---
> 
> Changes since v1:
> - Remove "(*)" from documentation. (Review from Masami Hiramatsu)
> - Fix support for "r100" without the event name (Review from Masami Hiramatsu)
> - Get rid of magic numbers within the code.  (Review from Steven Rostedt)
>   Note that I didn't use KRETPROBE_MAXACTIVE_ALLOC since that patch is not
>   merged.
> - Return -E2BIG when maxactive is too big.
> - Add basic selftest
> ---
>  Documentation/trace/kprobetrace.txt|  4 ++-
>  kernel/trace/trace_kprobe.c| 39 
> ++
>  .../ftrace/test.d/kprobe/kretprobe_maxactive.tc| 39 
> ++
>  3 files changed, 75 insertions(+), 7 deletions(-)
>  create mode 100644 
> tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc
> 
> diff --git a/Documentation/trace/kprobetrace.txt 
> b/Documentation/trace/kprobetrace.txt
> index 41ef9d8..7051a20 100644
> --- a/Documentation/trace/kprobetrace.txt
> +++ b/Documentation/trace/kprobetrace.txt
> @@ -23,7 +23,7 @@ current_tracer. Instead of that, add probe points via
>  Synopsis of kprobe_events
>  -
>p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS]   : Set a probe
> -  r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]  : Set a return probe
> +  r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]   : Set a return 
> probe
>-:[GRP/]EVENT  : Clear a probe
>  
>   GRP : Group name. If omitted, use "kprobes" for it.
> @@ -32,6 +32,8 @@ Synopsis of kprobe_events
>   MOD : Module name which has given SYM.
>   SYM[+offs]  : Symbol+offset where the probe is inserted.
>   MEMADDR : Address where the probe is inserted.
> + MAXACTIVE   : Maximum number of instances of the specified function that
> +   can be probed simultaneously, or 0 for the default.

BTW, to me, 0 means none (no instances can probe). This should have a
better description of what "0" actually means.

-- Steve


>  
>   FETCHARGS   : Arguments. Each probe can have up to 128 args.
>%REG   : Fetch register REG

Re: [PATCH v1] tracing/kprobes: expose maxactive for kretprobe in kprobe_events

2017-03-28 Thread Steven Rostedt

On Wed, 29 Mar 2017 00:23:35 +0900
Masami Hiramatsu  wrote:

> > @@ -598,8 +601,10 @@ static int create_trace_kprobe(int argc, char **argv)
> >  {
> > /*
> >  * Argument syntax:
> > -*  - Add kprobe: p[:[GRP/]EVENT] [MOD:]KSYM[+OFFS]|KADDR [FETCHARGS]
> > -*  - Add kretprobe: r[:[GRP/]EVENT] [MOD:]KSYM[+0] [FETCHARGS]
> > +*  - Add kprobe:
> > +*  p[:[GRP/]EVENT] [MOD:]KSYM[+OFFS]|KADDR [FETCHARGS]
> > +*  - Add kretprobe:
> > +*  r[MAXACTIVE][:[GRP/]EVENT] [MOD:]KSYM[+0] [FETCHARGS]
> >  * Fetch args:
> >  *  $retval : fetch return value
> >  *  $stack  : fetch stack address
> > @@ -619,6 +624,7 @@ static int create_trace_kprobe(int argc, char **argv)
> > int i, ret = 0;
> > bool is_return = false, is_delete = false;
> > char *symbol = NULL, *event = NULL, *group = NULL;
> > +   int maxactive = 0;
> > char *arg;
> > unsigned long offset = 0;
> > void *addr = NULL;
> > @@ -637,8 +643,26 @@ static int create_trace_kprobe(int argc, char **argv)
> > return -EINVAL;
> > }
> >  
> > -   if (argv[0][1] == ':') {
> > +   if (is_return && isdigit(argv[0][1]) && strchr([0][1], ':')) {  
> 
> This only supports r[MAXACTIVE:[GRP/]EVENT]. e.g "r100" without event name.

You mean it doesn't support adding MAXACTIVE without the ':event'.

> 
> Thank you,
> 
> > +   event = strchr([0][1], ':') + 1;
> > +   event[-1] = '\0';
> > +   ret = kstrtouint([0][1], 0, );
> > +   if (ret) {
> > +   pr_info("Failed to parse maxactive.\n");
> > +   return ret;
> > +   }
> > +   /* kretprobes instances are iterated over via a list. The
> > +* maximum should stay reasonable.
> > +*/
> > +   if (maxactive > 1024) {

Also, can we get rid of magic numbers within the code. There should be a
const or macro defined as MAX_MAXACTIVE or something, and use that here.

-- Steve


> > +   pr_info("Maxactive is too big.\n");
> > +   return -EINVAL;
> > +   }
> > +   } else if (argv[0][1] == ':') {
> > event = [0][2];
> > +   }
> > +
> > +   if (event) {
> > if (strchr(event, '/')) {
> > group = event;
> > event = strchr(group, '/') + 1;
> > @@ -718,8 +742,8 @@ static int create_trace_kprobe(int argc, char **argv)
> >  is_return ? 'r' : 'p', addr);
> > event = buf;
> > }
> > -   tk = alloc_trace_kprobe(group, event, addr, symbol, offset, argc,
> > -  is_return);
> > +   tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive,
> > +  argc, is_return);
> > if (IS_ERR(tk)) {
> > pr_info("Failed to allocate trace_probe.(%d)\n",
> > (int)PTR_ERR(tk));
> > -- 
> > 2.7.4
> >   
> 
>

Re: [PATCH v1] tracing/kprobes: expose maxactive for kretprobe in kprobe_events

2017-03-28 Thread Steven Rostedt


"[PATCH v1]" I like your confidence, or lack of, that there isn't going
to be a v2 or v3 ;-)


Masami, what do you think of this?

-- Steve

On Tue, 28 Mar 2017 15:52:22 +0200
Alban Crequy  wrote:

> When a kretprobe is installed on a kernel function, there is a maximum
> limit of how many calls in parallel it can catch (aka "maxactive"). A
> kernel module could call register_kretprobe() and initialize maxactive
> (see example in samples/kprobes/kretprobe_example.c).
> 
> But that is not exposed to userspace and it is currently not possible to
> choose maxactive when writing to /sys/kernel/debug/tracing/kprobe_events
> 
> The default maxactive can be as low as 1 on single-core with a
> non-preemptive kernel. This is too low and we need to increase it not
> only for recursive functions, but for functions that sleep or resched.
> 
> This patch updates the format of the command that can be written to
> kprobe_events so that maxactive can be optionally specified.
> 
> I need this for a bpf program attached to the kretprobe of
> inet_csk_accept, which can sleep for a long time.
> 
> BugLink: https://github.com/iovisor/bcc/issues/1072
> Signed-off-by: Alban Crequy 
> ---
>  Documentation/trace/kprobetrace.txt |  4 +++-
>  kernel/trace/trace_kprobe.c | 34 +-
>  2 files changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/trace/kprobetrace.txt 
> b/Documentation/trace/kprobetrace.txt
> index 41ef9d8..655ca7e 100644
> --- a/Documentation/trace/kprobetrace.txt
> +++ b/Documentation/trace/kprobetrace.txt
> @@ -23,7 +23,7 @@ current_tracer. Instead of that, add probe points via
>  Synopsis of kprobe_events
>  -
>p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS]   : Set a probe
> -  r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]  : Set a return probe
> +  r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]   : Set a return 
> probe
>-:[GRP/]EVENT  : Clear a probe
>  
>   GRP : Group name. If omitted, use "kprobes" for it.
> @@ -32,6 +32,8 @@ Synopsis of kprobe_events
>   MOD : Module name which has given SYM.
>   SYM[+offs]  : Symbol+offset where the probe is inserted.
>   MEMADDR : Address where the probe is inserted.
> + MAXACTIVE   : Maximum number of instances of the specified function that
> +   can be probed simultaneously, or 0 for the default.(*)
>  
>   FETCHARGS   : Arguments. Each probe can have up to 128 args.
>%REG   : Fetch register REG
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 5f688cc..807e01c 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -282,6 +282,7 @@ static struct trace_kprobe *alloc_trace_kprobe(const char 
> *group,
>void *addr,
>const char *symbol,
>unsigned long offs,
> +  int maxactive,
>int nargs, bool is_return)
>  {
>   struct trace_kprobe *tk;
> @@ -309,6 +310,8 @@ static struct trace_kprobe *alloc_trace_kprobe(const char 
> *group,
>   else
>   tk->rp.kp.pre_handler = kprobe_dispatcher;
>  
> + tk->rp.maxactive = maxactive;
> +
>   if (!event || !is_good_name(event)) {
>   ret = -EINVAL;
>   goto error;
> @@ -598,8 +601,10 @@ static int create_trace_kprobe(int argc, char **argv)
>  {
>   /*
>* Argument syntax:
> -  *  - Add kprobe: p[:[GRP/]EVENT] [MOD:]KSYM[+OFFS]|KADDR [FETCHARGS]
> -  *  - Add kretprobe: r[:[GRP/]EVENT] [MOD:]KSYM[+0] [FETCHARGS]
> +  *  - Add kprobe:
> +  *  p[:[GRP/]EVENT] [MOD:]KSYM[+OFFS]|KADDR [FETCHARGS]
> +  *  - Add kretprobe:
> +  *  r[MAXACTIVE][:[GRP/]EVENT] [MOD:]KSYM[+0] [FETCHARGS]
>* Fetch args:
>*  $retval : fetch return value
>*  $stack  : fetch stack address
> @@ -619,6 +624,7 @@ static int create_trace_kprobe(int argc, char **argv)
>   int i, ret = 0;
>   bool is_return = false, is_delete = false;
>   char *symbol = NULL, *event = NULL, *group = NULL;
> + int maxactive = 0;
>   char *arg;
>   unsigned long offset = 0;
>   void *addr = NULL;
> @@ -637,8 +643,26 @@ static int create_trace_kprobe(int argc, char **argv)
>   return -EINVAL;
>   }
>  
> - if (argv[0][1] == ':') {
> + if (is_return && isdigit(argv[0][1]) && strchr([0][1], ':')) {
> + event = strchr([0][1], ':') + 1;
> + event[-1] = '\0';
> + ret = kstrtouint([0][1], 0, );
> + if (ret) {
> + pr_info("Failed to parse maxactive.\n");
> + return ret;
> + }
> + /* kretprobes

[PATCH] netfilter: Force fake conntrack entry to be at least 8 bytes aligned

2017-03-10 Thread Steven Rostedt (VMware)

Since the nfct and nfctinfo have been combined, the nf_conn structure
must be at least 8 bytes aligned, as the 3 LSB bits are used for the
nfctinfo. But there's a fake nf_conn structure to denote untracked
connections, which is created by a PER_CPU construct. This does not
guarantee that it will be 8 bytes aligned and can break the logic in
determining the correct nfctinfo.

I triggered this on a 32bit machine with the following error:

BUG: unable to handle kernel NULL pointer dereference at 0af4
IP: nf_ct_deliver_cached_events+0x1b/0xfb
*pdpt = 31962001 *pde =  

Oops:  [#1] SMP
[Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
  OK  ]
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
task: c126ec00 task.stack: c1258000
EIP: nf_ct_deliver_cached_events+0x1b/0xfb
EFLAGS: 00010202 CPU: 0
EAX: 0021cd01 EBX:  ECX: 27b0c767 EDX: 32bcb17a
ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
 DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
CR0: 80050033 CR2: 0af4 CR3: 309a0440 CR4: 001406f0
Call Trace:
 
 ? ipv6_skip_exthdr+0xac/0xcb
 ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
 nf_hook_slow+0x22/0xc7
 nf_hook+0x9a/0xad [ipv6]
 ? ip6t_do_table+0x356/0x379 [ip6_tables]
 ? ip6_fragment+0x9e9/0x9e9 [ipv6]
 ip6_output+0xee/0x107 [ipv6]
 ? ip6_fragment+0x9e9/0x9e9 [ipv6]
 dst_output+0x36/0x4d [ipv6]
 NF_HOOK.constprop.37+0xb2/0xba [ipv6]
 ? icmp6_dst_alloc+0x2c/0xfd [ipv6]
 ? local_bh_enable+0x14/0x14 [ipv6]
 mld_sendpack+0x1c5/0x281 [ipv6]
 ? mark_held_locks+0x40/0x5c
 mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
 call_timer_fn+0x135/0x283
 ? detach_if_pending+0x55/0x55
 ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
 __run_timers+0x111/0x14b
 ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
 run_timer_softirq+0x1c/0x36
 __do_softirq+0x185/0x37c
 ? test_ti_thread_flag.constprop.19+0xd/0xd
 do_softirq_own_stack+0x22/0x28
 
 irq_exit+0x5a/0xa4
 smp_apic_timer_interrupt+0x2a/0x34
 apic_timer_interrupt+0x37/0x3c

By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
alignment as all cache line sizes are at least 8 bytes or more.

Fixes: a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area")
Signed-off-by: Steven Rostedt (VMware) <rost...@goodmis.org>
---
diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index f540f9a..1960587 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -244,7 +244,7 @@ extern s32 (*nf_ct_nat_offset)(const struct nf_conn *ct,
   u32 seq);
 
 /* Fake conntrack entry for untracked connections */
-DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
+DECLARE_PER_CPU_ALIGNED(struct nf_conn, nf_conntrack_untracked);
 static inline struct nf_conn *nf_ct_untracked_get(void)
 {
return raw_cpu_ptr(_conntrack_untracked);
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 071b97f..ffb78e5 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -181,7 +181,11 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 seqcount_t nf_conntrack_generation __read_mostly;
 
-DEFINE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
+/* nf_conn must be 8 bytes aligned, as the 3 LSB bits are used
+ * for the nfctinfo. We cheat by (ab)using the PER CPU cache line
+ * alignment to enforce this.
+ */
+DEFINE_PER_CPU_ALIGNED(struct nf_conn, nf_conntrack_untracked);
 EXPORT_PER_CPU_SYMBOL(nf_conntrack_untracked);
 
 static unsigned int nf_conntrack_hash_rnd __read_mostly;

Re: [PATCH net-next 1/3] trace: add variant without spacing in trace_print_hex_seq

2017-01-30 Thread Steven Rostedt

On Wed, 25 Jan 2017 02:28:16 +0100
Daniel Borkmann <dan...@iogearbox.net> wrote:

> For upcoming tracepoint support for BPF, we want to dump the program's
> tag. Format should be similar to __print_hex(), but without spacing.
> Add a __print_hex_str() variant for exactly that purpose that reuses
> trace_print_hex_seq().
> 
> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
> Cc: Steven Rostedt <rost...@goodmis.org>
> Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
> ---
>  include/linux/trace_events.h | 3 ++-
>  include/trace/trace_events.h | 8 +++-
>  kernel/trace/trace_output.c  | 7 ---
>  3 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index be00761..cfa475a 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -33,7 +33,8 @@ const char *trace_print_bitmask_seq(struct trace_seq *p, 
> void *bitmask_ptr,
>   unsigned int bitmask_size);
>  
>  const char *trace_print_hex_seq(struct trace_seq *p,
> - const unsigned char *buf, int len);
> + const unsigned char *buf, int len,
> + bool spacing);

Hmm, "spacing" doesn't really mean much. What about the invert of it,
and have "concatenate"?

>  
>  const char *trace_print_array_seq(struct trace_seq *p,
>  const void *buf, int count,
> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> index 467e12f..9f68462 100644
> --- a/include/trace/trace_events.h
> +++ b/include/trace/trace_events.h
> @@ -297,7 +297,12 @@
>  #endif
>  
>  #undef __print_hex
> -#define __print_hex(buf, buf_len) trace_print_hex_seq(p, buf, buf_len)
> +#define __print_hex(buf, buf_len)\
> + trace_print_hex_seq(p, buf, buf_len, true)
> +
> +#undef __print_hex_str
> +#define __print_hex_str(buf, buf_len)
> \
> + trace_print_hex_seq(p, buf, buf_len, false)
>  
>  #undef __print_array
>  #define __print_array(array, count, el_size) \
> @@ -711,6 +716,7 @@
>  #undef __print_flags
>  #undef __print_symbolic
>  #undef __print_hex
> +#undef __print_hex_str
>  #undef __get_dynamic_array
>  #undef __get_dynamic_array_len
>  #undef __get_str
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 5d33a73..30a144b1 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -163,14 +163,15 @@ enum print_line_t trace_print_printk_msg_only(struct 
> trace_iterator *iter)
>  EXPORT_SYMBOL_GPL(trace_print_bitmask_seq);
>  

With the addition of this boolean parameter, this function shold
probably have a kernel doc header, that can explain the parameters.

-- Steve

>  const char *
> -trace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int 
> buf_len)
> +trace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int 
> buf_len,
> + bool spacing)
>  {
>   int i;
>   const char *ret = trace_seq_buffer_ptr(p);
>  
>   for (i = 0; i < buf_len; i++)
> - trace_seq_printf(p, "%s%2.2x", i == 0 ? "" : " ", buf[i]);
> -
> + trace_seq_printf(p, "%s%2.2x", !spacing || i == 0 ? "" : " ",
> +  buf[i]);
>   trace_seq_putc(p, 0);
>  
>   return ret;

Re: [PATCH net-next 2/3] lib, traceevent: add PRINT_HEX_STR variant

2017-01-30 Thread Steven Rostedt

On Wed, 25 Jan 2017 02:28:17 +0100
Daniel Borkmann <dan...@iogearbox.net> wrote:

> Add support for the __print_hex_str() macro that was added for
> tracing, so that user space tools such as perf can understand
> it as well.
> 
> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
> Cc: Steven Rostedt <rost...@goodmis.org>
> Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
> ---

I haven't tested it, but it looks fine with me:

Acked-by: Steven Rostedt (VMware) <rost...@goodmis.org>

-- Steve

Re: [PATCH v3] net/phy: add trace events for mdio accesses

2016-11-22 Thread Steven Rostedt

On Tue, 22 Nov 2016 16:47:11 +0100
Uwe Kleine-König <u...@kleine-koenig.org> wrote:

> Make it possible to generate trace events for mdio read and write accesses.
> 
> Signed-off-by: Uwe Kleine-König <u...@kleine-koenig.org>

For the tracing side.

Acked-by: Steven Rostedt <rost...@goodmis.org>

-- Steve

Re: [PATCH v2] net/phy: add trace events for mdio accesses

2016-11-22 Thread Steven Rostedt

On Tue, 22 Nov 2016 11:01:27 +0100
Uwe Kleine-König  wrote:

> diff --git a/include/trace/events/mdio.h b/include/trace/events/mdio.h
> new file mode 100644
> index ..468e2d095d19
> --- /dev/null
> +++ b/include/trace/events/mdio.h
> @@ -0,0 +1,42 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mdio
> +
> +#if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MDIO_H
> +
> +#include 
> +
> +TRACE_EVENT_CONDITION(mdio_access,
> +
> + TP_PROTO(struct mii_bus *bus, int read,
> +  unsigned addr, unsigned regnum, u16 val, int err),
> +
> + TP_ARGS(bus, read, addr, regnum, val, err),
> +
> + TP_CONDITION(err >= 0),
> +
> + TP_STRUCT__entry(
> + __array(char, busid, MII_BUS_ID_SIZE)
> + __field(int, read)

read is just a 0 or 1. What about making it a char? That way we can
pack this better. If I'm not mistaken, MII_BUS_ID_SIZE is (20 - 3) or
17. If read is just one byte, then it can fit in one of those three
bytes, and you save 4 extra bytes (assuming addr will be 4 byte
aligned).

-- Steve


> + __field(unsigned, addr)
> + __field(unsigned, regnum)
> + __field(u16, val)
> + ),
> +
> + TP_fast_assign(
> + strncpy(__entry->busid, bus->id, MII_BUS_ID_SIZE);
> + __entry->read = read;
> + __entry->addr = addr;
> + __entry->regnum = regnum;
> + __entry->val = val;
> + ),
> +
> + TP_printk("%s %-5s phy:0x%02x reg:0x%02x val:0x%04hx",
> +   __entry->busid, __entry->read ? "read" : "write",
> +   __entry->addr, __entry->regnum, __entry->val)
> +);
> +
> +#endif /* if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ) */
> +
> +/* This part must be outside protection */
> +#include

Re: [PATCH] net/phy: add trace events for mdio accesses

2016-11-14 Thread Steven Rostedt

On Mon, 14 Nov 2016 12:03:35 +0100
Uwe Kleine-König  wrote:

> Make it possible to generate trace events for mdio read and write accesses.
> 
> Signed-off-by: Uwe Kleine-König 
> ---
>  drivers/net/phy/mdio_bus.c  | 15 +++
>  include/trace/events/mdio.h | 40 
>  2 files changed, 55 insertions(+)
>  create mode 100644 include/trace/events/mdio.h
> 
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 09deef4bed09..0f3f207419f6 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -38,6 +38,9 @@
>  
>  #include 
>  
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  int mdiobus_register_device(struct mdio_device *mdiodev)
>  {
>   if (mdiodev->bus->mdio_map[mdiodev->addr])
> @@ -461,6 +464,9 @@ int mdiobus_read_nested(struct mii_bus *bus, int addr, 
> u32 regnum)
>   retval = bus->read(bus, addr, regnum);
>   mutex_unlock(>mdio_lock);
>  
> + if (retval >= 0)
> + trace_mdio_access(bus, 1, addr, regnum, retval);

These cause branches to be taken when tracing is disabled, breaking the
zero overhead for disabled tracing guideline. As retval is passed to
the tracepoint, please look at TRACE_EVENT_CONDITION() and use that.
It will move the if statement into the enabling of the trace event and
keep the overhead to a minimum when the tracepoint is disabled.

Do the same for the below as well.

-- Steve

> +
>   return retval;
>  }
>  EXPORT_SYMBOL(mdiobus_read_nested);
> @@ -485,6 +491,9 @@ int mdiobus_read(struct mii_bus *bus, int addr, u32 
> regnum)
>   retval = bus->read(bus, addr, regnum);
>   mutex_unlock(>mdio_lock);
>  
> + if (retval >= 0)
> + trace_mdio_access(bus, 1, addr, regnum, retval);
> +
>   return retval;
>  }
>  EXPORT_SYMBOL(mdiobus_read);
> @@ -513,6 +522,9 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, 
> u32 regnum, u16 val)
>   err = bus->write(bus, addr, regnum, val);
>   mutex_unlock(>mdio_lock);
>  
> + if (!err)
> + trace_mdio_access(bus, 0, addr, regnum, val);
> +
>   return err;
>  }
>  EXPORT_SYMBOL(mdiobus_write_nested);
> @@ -538,6 +550,9 @@ int mdiobus_write(struct mii_bus *bus, int addr, u32 
> regnum, u16 val)
>   err = bus->write(bus, addr, regnum, val);
>   mutex_unlock(>mdio_lock);
>  
> + if (!err)
> + trace_mdio_access(bus, 0, addr, regnum, val);
> +
>   return err;
>  }
>  EXPORT_SYMBOL(mdiobus_write);
> diff --git a/include/trace/events/mdio.h b/include/trace/events/mdio.h
> new file mode 100644
> index ..dcb2d456a346
> --- /dev/null
> +++ b/include/trace/events/mdio.h
> @@ -0,0 +1,40 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mdio
> +
> +#if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MDIO_H
> +
> +#include 
> +
> +TRACE_EVENT(mdio_access,
> +
> + TP_PROTO(struct mii_bus *bus, int read,
> +  unsigned addr, unsigned regnum, u16 val),
> +
> + TP_ARGS(bus, read, addr, regnum, val),
> +
> + TP_STRUCT__entry(
> + __array(char, busid, MII_BUS_ID_SIZE)
> + __field(int, read)
> + __field(unsigned, addr)
> + __field(unsigned, regnum)
> + __field(u16, val)
> + ),
> +
> + TP_fast_assign(
> + strncpy(__entry->busid, bus->id, MII_BUS_ID_SIZE);
> + __entry->read = read;
> + __entry->addr = addr;
> + __entry->regnum = regnum;
> + __entry->val = val;
> + ),
> +
> + TP_printk("%s %-5s phy:0x%02x reg:0x%02x val:0x%04hx",
> +   __entry->busid, __entry->read ? "read" : "write",
> +   __entry->addr, __entry->regnum, __entry->val)
> +);
> +
> +#endif /* if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ) */
> +
> +/* This part must be outside protection */
> +#include

Re: [PATCH 249/249] net:ethernet:intel:igb_main.c: Add Throttling disable option in order to decrease latency usually required by RT applications.

2016-09-06 Thread Steven Rostedt

On Tue, 6 Sep 2016 13:17:04 +
Amir Yihie  wrote:

> 

I guess there's not much to do here?

-- Steve

Re: [RT PATCH 2/2] net: add a lock around icmp_sk()

2016-08-31 Thread Steven Rostedt

On Wed, 31 Aug 2016 09:37:43 -0700
Eric Dumazet  wrote:

> On Wed, 2016-08-31 at 18:00 +0200, Sebastian Andrzej Siewior wrote:
> > It looks like the this_cpu_ptr() access in icmp_sk() is protected with
> > local_bh_disable(). To avoid missing serialization in -RT I am adding
> > here a local lock. No crash has been observed, this is just precaution.
> >   
> 
> 
> Hmm...
> 
> > Signed-off-by: Sebastian Andrzej Siewior 
> > ---
> >  net/ipv4/icmp.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> > index c1f1d5030d37..63731fd6af3e 100644
> > --- a/net/ipv4/icmp.c
> > +++ b/net/ipv4/icmp.c
> > @@ -78,6 +78,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -205,6 +206,8 @@ static const struct icmp_control 
> > icmp_pointers[NR_ICMP_TYPES+1];
> >   *
> >   * On SMP we have one ICMP socket per-cpu.
> >   */
> > +static DEFINE_LOCAL_IRQ_LOCK(icmp_sk_lock);
> > +
> >  static struct sock *icmp_sk(struct net *net)
> >  {
> > return *this_cpu_ptr(net->ipv4.icmp_sk);
> > @@ -216,12 +219,14 @@ static inline struct sock *icmp_xmit_lock(struct net 
> > *net)
> >  
> > local_bh_disable();
> >  
> > +   local_lock(icmp_sk_lock);  
> 
> Deadlock alert ?
> Please read the comment few lines after, explaining why we have to use
> spin_trylock().
> Or maybe I should double check what is local_lock() in RT

And I don't know exactly what the deadlock scenario of the comment
below is. Is it racing with a softirq somehow?

Note, in RT softirqs can schedule out, and are preemptable. But I don't
know enough about this code to know if that is enough to not have a
deadlock here.

-- Steve


> 
> > sk = icmp_sk(net);
> >  
> > if (unlikely(!spin_trylock(>sk_lock.slock))) {  
> 
> ...
> > /* This can happen if the output path signals a
> >  * dst_link_failure() for an outgoing ICMP packet.
> >  */  
> 
> 
> > +   local_unlock(icmp_sk_lock);
> > local_bh_enable();
> > return NULL;  
>

Re: [RT PATCH 1/2] net: add back the missing serialization in ip_send_unicast_reply()

2016-08-31 Thread Steven Rostedt

On Wed, 31 Aug 2016 18:00:48 +0200
Sebastian Andrzej Siewior  wrote:

> Some time ago Sami Pietikäinen reported a crash on -RT in
> ip_send_unicast_reply() which was later fixed by Nicholas Mc Guire
> (v3.12.8-rt11). Later (v3.18.8) the code was reworked and I dropped the
> patch. As it turns out it was mistake.
> I have reports that the same crash is possible with a similar backtrace.
> It seems that vanilla protects access to this_cpu_ptr() via
> local_bh_disable(). This does not work on -RT since we can have
> NET_RX and NET_TX running in parallel on the same CPU.
> This is brings back the old locks.
> 
> |Unable to handle kernel NULL pointer dereference at virtual address 0010
> |PC is at __ip_make_skb+0x198/0x3e8
> |[] (__ip_make_skb) from [] 
> (ip_push_pending_frames+0x20/0x40)
> |[] (ip_push_pending_frames) from [] 
> (ip_send_unicast_reply+0x210/0x22c)
> |[] (ip_send_unicast_reply) from [] 
> (tcp_v4_send_reset+0x190/0x1c0)
> |[] (tcp_v4_send_reset) from [] 
> (tcp_v4_do_rcv+0x22c/0x288)
> |[] (tcp_v4_do_rcv) from [] (release_sock+0xb4/0x150)
> |[] (release_sock) from [] (tcp_close+0x240/0x454)
> |[] (tcp_close) from [] (inet_release+0x74/0x7c)
> |[] (inet_release) from [] (sock_release+0x30/0xb0)
> |[] (sock_release) from [] (sock_close+0x1c/0x24)
> |[] (sock_close) from [] (__fput+0xe8/0x20c)
> |[] (__fput) from [] (fput+0x18/0x1c)
> |[] (fput) from [] (task_work_run+0xa4/0xb8)
> |[] (task_work_run) from [] (do_work_pending+0xd0/0xe4)
> |[] (do_work_pending) from [] (work_pending+0xc/0x20)
> |Code: e3530001 8a01 e3a00040 ea11 (e5973010)
> 
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
>  net/ipv4/tcp_ipv4.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index ad450509029b..c5521d1f1263 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -62,6 +62,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -565,6 +566,7 @@ void tcp_v4_send_check(struct sock *sk, struct sk_buff 
> *skb)
>  }
>  EXPORT_SYMBOL(tcp_v4_send_check);
>  
> +static DEFINE_LOCAL_IRQ_LOCK(tcp_sk_lock);
>  /*
>   *   This routine will send an RST to the other tcp.
>   *
> @@ -689,10 +691,13 @@ static void tcp_v4_send_reset(const struct sock *sk, 
> struct sk_buff *skb)
>offsetof(struct inet_timewait_sock, tw_bound_dev_if));
>  
>   arg.tos = ip_hdr(skb)->tos;
> +
> + local_lock(tcp_sk_lock);

Interesting that I noticed in mainline, they have:

local_bh_disable();

here.

I'm surprised we don't have a local_lock_bh() or something to that
effect.

-- Steve

>   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
> skb, _SKB_CB(skb)->header.h4.opt,
> ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
> , arg.iov[0].iov_len);
> + local_unlock(tcp_sk_lock);
>  
>   TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
>   TCP_INC_STATS_BH(net, TCP_MIB_OUTRSTS);
> @@ -774,10 +779,12 @@ static void tcp_v4_send_ack(struct net *net,
>   if (oif)
>   arg.bound_dev_if = oif;
>   arg.tos = tos;
> + local_lock(tcp_sk_lock);
>   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
> skb, _SKB_CB(skb)->header.h4.opt,
> ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
> , arg.iov[0].iov_len);
> + local_unlock(tcp_sk_lock);
>  
>   TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
>  }

Re: [4.4-RT PATCH RFC/RFT] drivers: net: cpsw: mark rx/tx irq as IRQF_NO_THREAD

2016-08-11 Thread Steven Rostedt

On Thu, 11 Aug 2016 19:15:40 +0300
Grygorii Strashko  wrote:

> Mark CPSW Rx/Tx IRQs as IRQF_NO_THREAD and avoid double scheduling on -RT
> where this IRQs are forced threaded:
>  rx-irq
>   |- schedule threaded rx-irq handler
> ...
>   |- threaded rx-irq handler -> cpsw_rx_interrupt()
>  |- napi_schedule()
>   |- __raise_softirq_irqoff()
>  |- wakeup_proper_softirq()
> ...
>   napi
> 
> after:
>  rx-irq
>   |- cpsw_rx_interrupt()
>  |- napi_schedule()
>   |- irq_exit()
>  |- invoke_softirq()
>  |- wakeup_softirqd()
> ...
>   napi
> 
> And, as result, get benefits from the following improvements (tested
> on am57xx-evm):
> 
> 1) "[ 78.348599] NOHZ: local_softirq_pending 80" message will not be
>seen any more. Now these warnings can be seen once iperf is started.
># iperf -c $IPERFHOST -w 128K  -d -t 60
> 
> 2) latency reduction when cyclictest is run in parallel with network load
>  where net_perf.sh is:
>iperf -c $IPERFHOST -w 8K-d -t 60
>iperf -c $IPERFHOST -w 16K   -d -t 60
>iperf -c $IPERFHOST -w 32K   -d -t 60
>iperf -c $IPERFHOST -w 64K   -d -t 60
>iperf -c $IPERFHOST -w 128K  -d -t 60
> 
> before:
> T: 0 ( 1326) P:98 I:1000 C: 24 Min:  8 Act:   13 Avg:   18 Max:  
> 70
> T: 1 ( 1327) P:98 I:1500 C: 159981 Min:  9 Act:   15 Avg:   16 Max:  
> 43
> after:
> T: 0 ( 1331) P:98 I:1000 C: 24 Min:  8 Act:   15 Avg:   14 Max:  
> 51
> T: 1 ( 1332) P:98 I:1500 C: 159953 Min:  8 Act:   16 Avg:   15 Max:  
> 33
> 
> 3) network performance increase
> 
> win, KMbits/s
>   before  after   %
> 8K354 350.3   0.0
> 16K   412 551 33.7
> 32K   423 659.5   55.9
> 64K   436 728.3   67.0
> 128K  537 845 57.4
> 
> This change does not affect on non-RT.

This looks fine to me, but it should go into the development branch,
which is currently 4.6-rt. And I can then pull it from there.

-- Steve


> 
> Signed-off-by: Grygorii Strashko 
> ---
> Hi All,
> 
> I'll be appreciated on any feedback or tested-by.
> In case of positive feedback I'll resend it for upstream.
> 
>  drivers/net/ethernet/ti/cpsw.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
> index 7b59283..fa4bb81 100644
> --- a/drivers/net/ethernet/ti/cpsw.c
> +++ b/drivers/net/ethernet/ti/cpsw.c
> @@ -769,7 +769,7 @@ static irqreturn_t cpsw_tx_interrupt(int irq, void 
> *dev_id)
>   priv->tx_irq_disabled = true;
>   }
>  
> - napi_schedule(>napi_tx);
> + napi_schedule_irqoff(>napi_tx);
>   return IRQ_HANDLED;
>  }
>  
> @@ -785,7 +785,7 @@ static irqreturn_t cpsw_rx_interrupt(int irq, void 
> *dev_id)
>   priv->rx_irq_disabled = true;
>   }
>  
> - napi_schedule(>napi_rx);
> + napi_schedule_irqoff(>napi_rx);
>   return IRQ_HANDLED;
>  }
>  
> @@ -2827,7 +2827,7 @@ static int cpsw_probe(struct platform_device *pdev)
>  
>   priv->irqs_table[0] = irq;
>   ret = devm_request_irq(>dev, irq, cpsw_rx_interrupt,
> -0, dev_name(>dev), priv);
> +IRQF_NO_THREAD, dev_name(>dev), priv);
>   if (ret < 0) {
>   dev_err(priv->dev, "error attaching irq (%d)\n", ret);
>   goto clean_ale_ret;
> @@ -2842,7 +2842,7 @@ static int cpsw_probe(struct platform_device *pdev)
>  
>   priv->irqs_table[1] = irq;
>   ret = devm_request_irq(>dev, irq, cpsw_tx_interrupt,
> -0, dev_name(>dev), priv);
> +IRQF_NO_THREAD, dev_name(>dev), priv);
>   if (ret < 0) {
>   dev_err(priv->dev, "error attaching irq (%d)\n", ret);
>   goto clean_ale_ret;

Re: [PATCH 1084/1285] Replace numeric parameter like 0444 with macro

2016-08-02 Thread Steven Rostedt

On Tue,  2 Aug 2016 20:15:02 +0800
Baole Ni  wrote:

> I find that the developers often just specified the numeric value
> when calling a macro which is defined with a parameter for access permission.
> As we know, these numeric value for access permission have had the 
> corresponding macro,
> and that using macro can improve the robustness and readability of the code,
> thus, I suggest replacing the numeric parameter with the macro.
> 

NACK!

I find 0444 more readable than S_IRUSR | S_IRGRP | S_IROTH.

-- Steve

> Signed-off-by: Chuansheng Liu 
> Signed-off-by: Baole Ni 
> ---
>  net/802/mrp.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/802/mrp.c b/net/802/mrp.c
> index 72db278..4ff215f 100644
> --- a/net/802/mrp.c
> +++ b/net/802/mrp.c
> @@ -22,11 +22,11 @@
>  #include 
>  
>  static unsigned int mrp_join_time __read_mostly = 200;
> -module_param(mrp_join_time, uint, 0644);
> +module_param(mrp_join_time, uint, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
>  MODULE_PARM_DESC(mrp_join_time, "Join time in ms (default 200ms)");
>  
>  static unsigned int mrp_periodic_time __read_mostly = 1000;
> -module_param(mrp_periodic_time, uint, 0644);
> +module_param(mrp_periodic_time, uint, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
>  MODULE_PARM_DESC(mrp_periodic_time, "Periodic time in ms (default 1s)");
>  
>  MODULE_LICENSE("GPL");

Re: [PATCH 1083/1285] Replace numeric parameter like 0444 with macro

2016-08-02 Thread Steven Rostedt

On Tue,  2 Aug 2016 20:14:55 +0800
Baole Ni  wrote:

> I find that the developers often just specified the numeric value
> when calling a macro which is defined with a parameter for access permission.
> As we know, these numeric value for access permission have had the 
> corresponding macro,
> and that using macro can improve the robustness and readability of the code,
> thus, I suggest replacing the numeric parameter with the macro.
> 

NACK!

I find 0444 more readable than S_IRUSR | S_IRGRP | S_IROTH.

-- Steve

> Signed-off-by: Chuansheng Liu 
> Signed-off-by: Baole Ni 
> ---
>  net/802/garp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/802/garp.c b/net/802/garp.c
> index b38ee6d..ecc976a 100644
> --- a/net/802/garp.c
> +++ b/net/802/garp.c
> @@ -22,7 +22,7 @@
>  #include 
>  
>  static unsigned int garp_join_time __read_mostly = 200;
> -module_param(garp_join_time, uint, 0644);
> +module_param(garp_join_time, uint, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
>  MODULE_PARM_DESC(garp_join_time, "Join time in ms (default 200ms)");
>  MODULE_LICENSE("GPL");
>

Re: [patch net-next 2/2] devlink: fix trace format string

2016-07-14 Thread Steven Rostedt

On Thu, 14 Jul 2016 10:07:38 -0700
Randy Dunlap  wrote:

> On 07/14/16 02:37, Jiri Pirko wrote:
> > From: Arnd Bergmann 
> > 
> > Including devlink.h on ARM and probably other 32-bit architectures results 
> > in
> > a harmless warning:
> > 
> > In file included from ../include/trace/define_trace.h:95:0,
> >  from ../include/trace/events/devlink.h:51,
> >  from ../net/core/devlink.c:30:
> > include/trace/events/devlink.h: In function 
> > 'trace_raw_output_devlink_hwmsg':
> > include/trace/events/devlink.h:42:12: error: format '%lu' expects argument 
> > of type 'long unsigned int', but argument 10 has type 'size_t {aka unsigned 
> > int}' [-Werror=format=]
> > 
> > The correct format string for 'size_t' is %zu, not %lu, this works on all
> > architectures.
> > 
> > Signed-off-by: Arnd Bergmann 
> > Fixes: e5224f0fe2ac ("devlink: add hardware messages tracing facility")
> > Signed-off-by: Jiri Pirko   
> 
> Acked-by: Randy Dunlap 
> 

ditto!

-- Steve

1 2 >

1 - 100 of 165 matches

Mail list logo