Re: [PATCH v2 1/3] tracing: Deference pointers without RCU checks

2016-08-28 Thread Masami Hiramatsu
Hi Binoy,

On Fri, 26 Aug 2016 10:56:12 +0530
Binoy Jayan  wrote:

> On 26 August 2016 at 07:19, Masami Hiramatsu  wrote:
> > On Wed, 24 Aug 2016 16:47:28 +0530
> >> "__local_bh_enable() tests if this is the last SOFTIRQ_OFFSET, if so it
> >> tells lockdep softirqs are enabled with trace_softirqs_on() after that
> >> we go an actually modify the preempt_count with preempt_count_sub().
> >> Then in preempt_count_sub() you call into trace_preempt_on() if this
> >> was the last preempt_count increment but you do that _before_ you
> >> actually change the preempt_count with __preempt_count_sub() at this
> >> point lockdep and preempt_count think the world differs and *boom*"
> >>
> >> So the simplest way to avoid this is by disabling the consistency
> >> checks.
> >>
> >> We also need to take care of the iterating in trace_events_trigger.c
> >> to avoid a splatter in conjunction with the hist trigger.
> >
> > Special care for lockdep inside tracepoint handler is reasonable.
> >
> > Reviewed-by: Masami Hiramatsu 
> >
> > Steven, since this seems a bugfix, could you pick this from the series?
> >
> > Thank you,
> >
> > --
> > Masami Hiramatsu 
> 
> 
> Hi Daniel/Masami,
> 
> I ran into a similar rcu error while using same tracepoint for all
> three latency types
> and using a filter like below to trigger only events falling under a
> specific type.
> 
> echo 'hist:key=ltype,cpu:val=latency:sort=ltype,cpu if ltype==0' > \
>/sys/kernel/debug/tracing/events/latency/latency_preempt/trigger
> 
> The error occurs only when I use the predicate 'if ltype==0' as filter.
> 
> It occurs in 'filter_match_preds' during a call to 'rcu_dereference_sched'.
> kernel/trace/trace_events_filter.c +611 : filter_match_preds()
> 
> Surprisingly, this happens only the first time the echo command is used on
> the trigger file after each boot.
> 
> Do you think it is similar to the bug you have fixed? May be i'll try using
> "rcu_dereference_raw_notrace" instead of 'rcu_dereference_sched'.

Yes, I think we should fix that too. No rcu_derefaernce_sched used in
trace callees. (except for kprobe events, since it is out of context)

Thank you,


> 
>  Binoy
> 
> 
> [ 1029.324257] ===
> [ 1029.324785] [ INFO: suspicious RCU usage. ]
> [ 1029.328698] 4.7.0+ #49 Not tainted
> [ 1029.332858] ---
> [ 1029.336334] 
> /local/mnt/workspace/src/korg/linux/kernel/trace/trace_events_filter.c:611
> suspicious rcu_dereference_check() usage!
> [ 1029.340423]
> [ 1029.340423] other info that might help us debug this:
> [ 1029.340423]
> [ 1029.352226]
> [ 1029.352226] RCU used illegally from idle CPU!
> [ 1029.352226] rcu_scheduler_active = 1, debug_locks = 0
> [ 1029.359953] RCU used illegally from extended quiescent state!
> [ 1029.371057] no locks held by swapper/0/0.
> [ 1029.376696]
> [ 1029.376696] stack backtrace:
> [ 1029.380693] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.7.0+ #49
> [ 1029.385033] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> [ 1029.39] Call trace:
> [ 1029.397798] [] dump_backtrace+0x0/0x1e0
> [ 1029.399967] [] show_stack+0x24/0x2c
> [ 1029.405523] [] dump_stack+0xb0/0xf0
> [ 1029.410557] [] lockdep_rcu_suspicious+0xe8/0x120
> [ 1029.415595] [] filter_match_preds+0x108/0x118
> [ 1029.421669] [] event_triggers_call+0x5c/0xc0
> [ 1029.427485] [] trace_event_buffer_commit+0x11c/0x244
> [ 1029.433390] []
> trace_event_raw_event_latency_template+0x58/0xa4
> [ 1029.439902] [] time_hardirqs_on+0x264/0x290
> [ 1029.447450] [] trace_hardirqs_on_caller+0x20/0x180
> [ 1029.453179] [] trace_hardirqs_on+0x10/0x18
> [ 1029.459604] [] cpuidle_enter_state+0xc8/0x2e0
> [ 1029.465246] [] cpuidle_enter+0x34/0x40
> [ 1029.470888] [] call_cpuidle+0x3c/0x5c
> [ 1029.476442] [] cpu_startup_entry+0x1c0/0x360
> [ 1029.481392] [] rest_init+0x150/0x160
> [ 1029.487293] [] start_kernel+0x3a4/0x3b8
> [ 1029.492415] [] __primary_switched+0x30/0x74
> 
> Binoy


-- 
Masami Hiramatsu 


Re: [PATCH v2 1/3] tracing: Deference pointers without RCU checks

2016-08-25 Thread Binoy Jayan
On 26 August 2016 at 07:19, Masami Hiramatsu  wrote:
> On Wed, 24 Aug 2016 16:47:28 +0530
>> "__local_bh_enable() tests if this is the last SOFTIRQ_OFFSET, if so it
>> tells lockdep softirqs are enabled with trace_softirqs_on() after that
>> we go an actually modify the preempt_count with preempt_count_sub().
>> Then in preempt_count_sub() you call into trace_preempt_on() if this
>> was the last preempt_count increment but you do that _before_ you
>> actually change the preempt_count with __preempt_count_sub() at this
>> point lockdep and preempt_count think the world differs and *boom*"
>>
>> So the simplest way to avoid this is by disabling the consistency
>> checks.
>>
>> We also need to take care of the iterating in trace_events_trigger.c
>> to avoid a splatter in conjunction with the hist trigger.
>
> Special care for lockdep inside tracepoint handler is reasonable.
>
> Reviewed-by: Masami Hiramatsu 
>
> Steven, since this seems a bugfix, could you pick this from the series?
>
> Thank you,
>
> --
> Masami Hiramatsu 


Hi Daniel/Masami,

I ran into a similar rcu error while using same tracepoint for all
three latency types
and using a filter like below to trigger only events falling under a
specific type.

echo 'hist:key=ltype,cpu:val=latency:sort=ltype,cpu if ltype==0' > \
   /sys/kernel/debug/tracing/events/latency/latency_preempt/trigger

The error occurs only when I use the predicate 'if ltype==0' as filter.

It occurs in 'filter_match_preds' during a call to 'rcu_dereference_sched'.
kernel/trace/trace_events_filter.c +611 : filter_match_preds()

Surprisingly, this happens only the first time the echo command is used on
the trigger file after each boot.

Do you think it is similar to the bug you have fixed? May be i'll try using
"rcu_dereference_raw_notrace" instead of 'rcu_dereference_sched'.

 Binoy


[ 1029.324257] ===
[ 1029.324785] [ INFO: suspicious RCU usage. ]
[ 1029.328698] 4.7.0+ #49 Not tainted
[ 1029.332858] ---
[ 1029.336334] 
/local/mnt/workspace/src/korg/linux/kernel/trace/trace_events_filter.c:611
suspicious rcu_dereference_check() usage!
[ 1029.340423]
[ 1029.340423] other info that might help us debug this:
[ 1029.340423]
[ 1029.352226]
[ 1029.352226] RCU used illegally from idle CPU!
[ 1029.352226] rcu_scheduler_active = 1, debug_locks = 0
[ 1029.359953] RCU used illegally from extended quiescent state!
[ 1029.371057] no locks held by swapper/0/0.
[ 1029.376696]
[ 1029.376696] stack backtrace:
[ 1029.380693] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.7.0+ #49
[ 1029.385033] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[ 1029.39] Call trace:
[ 1029.397798] [] dump_backtrace+0x0/0x1e0
[ 1029.399967] [] show_stack+0x24/0x2c
[ 1029.405523] [] dump_stack+0xb0/0xf0
[ 1029.410557] [] lockdep_rcu_suspicious+0xe8/0x120
[ 1029.415595] [] filter_match_preds+0x108/0x118
[ 1029.421669] [] event_triggers_call+0x5c/0xc0
[ 1029.427485] [] trace_event_buffer_commit+0x11c/0x244
[ 1029.433390] []
trace_event_raw_event_latency_template+0x58/0xa4
[ 1029.439902] [] time_hardirqs_on+0x264/0x290
[ 1029.447450] [] trace_hardirqs_on_caller+0x20/0x180
[ 1029.453179] [] trace_hardirqs_on+0x10/0x18
[ 1029.459604] [] cpuidle_enter_state+0xc8/0x2e0
[ 1029.465246] [] cpuidle_enter+0x34/0x40
[ 1029.470888] [] call_cpuidle+0x3c/0x5c
[ 1029.476442] [] cpu_startup_entry+0x1c0/0x360
[ 1029.481392] [] rest_init+0x150/0x160
[ 1029.487293] [] start_kernel+0x3a4/0x3b8
[ 1029.492415] [] __primary_switched+0x30/0x74

Binoy


Re: [PATCH v2 1/3] tracing: Deference pointers without RCU checks

2016-08-25 Thread Masami Hiramatsu
On Wed, 24 Aug 2016 16:47:28 +0530
Binoy Jayan  wrote:

> From: Daniel Wagner 
> 
> The tracepoint can't be used in code section where we are in the
> middle of a state transition.
> 
> For example if we place a tracepoint inside start/stop_critical_section(),
> lockdep complains with
> 
> [0.035589] WARNING: CPU: 0 PID: 3 at kernel/locking/lockdep.c:3560 \
> check_flags.part.36+0x1bc/0x210() [0.036000] \
> DEBUG_LOCKS_WARN_ON(current->softirqs_enabled) [0.036000] Kernel panic - 
> not \
> syncing: panic_on_warn set ... [0.036000]
> [0.036000] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc7+ #460
> [0.036000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS \
> 1.7.5-20140709_153950- 04/01/2014 [0.036000]  81f2463a 
> 88007c93bb98 \
> 81afb317 0001 [0.036000]  81f212b3 
> 88007c93bc18 \
> 81af7bc2 88007c93bbb8 [0.036000]  0008 
> 88007c93bc28 \
> 88007c93bbc8 0093bbd8 [0.036000] Call Trace:
> [0.036000]  [] dump_stack+0x4f/0x7b
> [0.036000]  [] panic+0xc0/0x1e9
> [0.036000]  [] ? _raw_spin_unlock_irqrestore+0x38/0x80
> [0.036000]  [] warn_slowpath_common+0xc0/0xc0
> [0.036000]  [] ? __local_bh_enable+0x36/0x70
> [0.036000]  [] warn_slowpath_fmt+0x46/0x50
> [0.036000]  [] check_flags.part.36+0x1bc/0x210
> [0.036000]  [] lock_is_held+0x78/0x90
> [0.036000]  [] ? __local_bh_enable+0x36/0x70
> [0.036000]  [] ? __do_softirq+0x3db/0x500
> [0.036000]  [] trace_preempt_on+0x255/0x260
> [0.036000]  [] preempt_count_sub+0xab/0xf0
> [0.036000]  [] __local_bh_enable+0x36/0x70
> [0.036000]  [] __do_softirq+0x3db/0x500
> [0.036000]  [] run_ksoftirqd+0x1f/0x60
> [0.036000]  [] smpboot_thread_fn+0x193/0x2a0
> [0.036000]  [] ? SyS_setgroups+0x150/0x150
> [0.036000]  [] kthread+0xf2/0x110
> [0.036000]  [] ? wait_for_completion+0xc3/0x120
> [0.036000]  [] ? preempt_count_sub+0xab/0xf0
> [0.036000]  [] ? kthread_create_on_node+0x240/0x240
> [0.036000]  [] ret_from_fork+0x58/0x90
> [0.036000]  [] ? kthread_create_on_node+0x240/0x240
> [0.036000] ---[ end Kernel panic - not syncing: panic_on_warn set ...
> 
> PeterZ was so kind to explain it to me what is happening:
> 
> "__local_bh_enable() tests if this is the last SOFTIRQ_OFFSET, if so it
> tells lockdep softirqs are enabled with trace_softirqs_on() after that
> we go an actually modify the preempt_count with preempt_count_sub().
> Then in preempt_count_sub() you call into trace_preempt_on() if this
> was the last preempt_count increment but you do that _before_ you
> actually change the preempt_count with __preempt_count_sub() at this
> point lockdep and preempt_count think the world differs and *boom*"
> 
> So the simplest way to avoid this is by disabling the consistency
> checks.
> 
> We also need to take care of the iterating in trace_events_trigger.c
> to avoid a splatter in conjunction with the hist trigger.

Special care for lockdep inside tracepoint handler is reasonable.

Reviewed-by: Masami Hiramatsu 

Steven, since this seems a bugfix, could you pick this from the series?

Thank you,

> 
> Signed-off-by: Daniel Wagner 
> ---
>  include/linux/rculist.h | 36 
>  include/linux/tracepoint.h  |  4 ++--
>  kernel/trace/trace_events_trigger.c |  6 +++---
>  3 files changed, 41 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 8beb98d..bee836b 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -279,6 +279,24 @@ static inline void list_splice_tail_init_rcu(struct 
> list_head *list,
>   container_of(lockless_dereference(ptr), type, member)
>  
>  /**
> + * list_entry_rcu_notrace - get the struct for this entry (for tracing)
> + * @ptr:the &struct list_head pointer.
> + * @type:   the type of the struct this is embedded in.
> + * @member: the name of the list_head within the struct.
> + *
> + * This primitive may safely run concurrently with the _rcu list-mutation
> + * primitives such as list_add_rcu() as long as it's guarded by 
> rcu_read_lock().
> + *
> + * This is the same as list_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_entry_rcu_notrace(ptr, type, member) \
> +({ \
> + typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
> + container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), type, 
> member); \
> +})
> +
> +/**
>   * Where are list_empty_rcu() and list_first_entry_rcu()?
>   *
>   * Implementing those functions following their counterparts list_empty() and
> @@ -391,6 +409,24 @@ static inline void list_splice_tail_init_rcu(struct 
> list_head *list,
>pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
>  
>  /**
> + * list_for_each_entry_rcu_notrace   -   iterate

[PATCH v2 1/3] tracing: Deference pointers without RCU checks

2016-08-24 Thread Binoy Jayan
From: Daniel Wagner 

The tracepoint can't be used in code section where we are in the
middle of a state transition.

For example if we place a tracepoint inside start/stop_critical_section(),
lockdep complains with

[0.035589] WARNING: CPU: 0 PID: 3 at kernel/locking/lockdep.c:3560 \
check_flags.part.36+0x1bc/0x210() [0.036000] \
DEBUG_LOCKS_WARN_ON(current->softirqs_enabled) [0.036000] Kernel panic - 
not \
syncing: panic_on_warn set ... [0.036000]
[0.036000] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc7+ #460
[0.036000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS \
1.7.5-20140709_153950- 04/01/2014 [0.036000]  81f2463a 
88007c93bb98 \
81afb317 0001 [0.036000]  81f212b3 
88007c93bc18 \
81af7bc2 88007c93bbb8 [0.036000]  0008 
88007c93bc28 \
88007c93bbc8 0093bbd8 [0.036000] Call Trace:
[0.036000]  [] dump_stack+0x4f/0x7b
[0.036000]  [] panic+0xc0/0x1e9
[0.036000]  [] ? _raw_spin_unlock_irqrestore+0x38/0x80
[0.036000]  [] warn_slowpath_common+0xc0/0xc0
[0.036000]  [] ? __local_bh_enable+0x36/0x70
[0.036000]  [] warn_slowpath_fmt+0x46/0x50
[0.036000]  [] check_flags.part.36+0x1bc/0x210
[0.036000]  [] lock_is_held+0x78/0x90
[0.036000]  [] ? __local_bh_enable+0x36/0x70
[0.036000]  [] ? __do_softirq+0x3db/0x500
[0.036000]  [] trace_preempt_on+0x255/0x260
[0.036000]  [] preempt_count_sub+0xab/0xf0
[0.036000]  [] __local_bh_enable+0x36/0x70
[0.036000]  [] __do_softirq+0x3db/0x500
[0.036000]  [] run_ksoftirqd+0x1f/0x60
[0.036000]  [] smpboot_thread_fn+0x193/0x2a0
[0.036000]  [] ? SyS_setgroups+0x150/0x150
[0.036000]  [] kthread+0xf2/0x110
[0.036000]  [] ? wait_for_completion+0xc3/0x120
[0.036000]  [] ? preempt_count_sub+0xab/0xf0
[0.036000]  [] ? kthread_create_on_node+0x240/0x240
[0.036000]  [] ret_from_fork+0x58/0x90
[0.036000]  [] ? kthread_create_on_node+0x240/0x240
[0.036000] ---[ end Kernel panic - not syncing: panic_on_warn set ...

PeterZ was so kind to explain it to me what is happening:

"__local_bh_enable() tests if this is the last SOFTIRQ_OFFSET, if so it
tells lockdep softirqs are enabled with trace_softirqs_on() after that
we go an actually modify the preempt_count with preempt_count_sub().
Then in preempt_count_sub() you call into trace_preempt_on() if this
was the last preempt_count increment but you do that _before_ you
actually change the preempt_count with __preempt_count_sub() at this
point lockdep and preempt_count think the world differs and *boom*"

So the simplest way to avoid this is by disabling the consistency
checks.

We also need to take care of the iterating in trace_events_trigger.c
to avoid a splatter in conjunction with the hist trigger.

Signed-off-by: Daniel Wagner 
---
 include/linux/rculist.h | 36 
 include/linux/tracepoint.h  |  4 ++--
 kernel/trace/trace_events_trigger.c |  6 +++---
 3 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 8beb98d..bee836b 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -279,6 +279,24 @@ static inline void list_splice_tail_init_rcu(struct 
list_head *list,
container_of(lockless_dereference(ptr), type, member)
 
 /**
+ * list_entry_rcu_notrace - get the struct for this entry (for tracing)
+ * @ptr:the &struct list_head pointer.
+ * @type:   the type of the struct this is embedded in.
+ * @member: the name of the list_head within the struct.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by 
rcu_read_lock().
+ *
+ * This is the same as list_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_entry_rcu_notrace(ptr, type, member) \
+({ \
+   typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
+   container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), type, 
member); \
+})
+
+/**
  * Where are list_empty_rcu() and list_first_entry_rcu()?
  *
  * Implementing those functions following their counterparts list_empty() and
@@ -391,6 +409,24 @@ static inline void list_splice_tail_init_rcu(struct 
list_head *list,
 pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_rcu_notrace -   iterate over rcu list of given 
type (for tracing)
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_head within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently with
+ * the _rcu list-mutation primitives such as list_add_rcu()
+ * as long as the traversal is guarded by rcu_read_lock().
+ *
+ * This is the same as list_for_each_entry_rcu() exce