Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov
On 01/06/17 17:16, Nikolay Aleksandrov wrote:
> On 01/06/17 17:00, Nikolay Aleksandrov wrote:
>> On 01/06/17 15:34, Sebastian Ott wrote:
>>> On Thu, 1 Jun 2017, Xin Long wrote:
 On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
  wrote:
> [...]
 I couldn't see any bridge-related thing here, and it couldn't be reproduced
 with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
 is something else in you machine.

 With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
 your
 machine normally, then:
 # brctl addbr br0
 # ip link set br0 up
 # brctl stp br0 on

 to check if it will still hang.
>>>
>>> Nope. That doesn't hang.
>>>
>>>
 If it can't be reproduced in this way, pls add this on your kernel:

 --- a/net/bridge/br_stp_if.c
 +++ b/net/bridge/br_stp_if.c
 @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
 br->stp_enabled = BR_KERNEL_STP;
 br_debug(br, "using kernel STP\n");

 +   WARN_ON(1);
 /* To start timers on any ports left in blocking */
 mod_timer(&br->hello_timer, jiffies + br->hello_time);
 br_port_state_selection(br);
 +   pr_warn("hello timer start done\n");
 }

 spin_unlock_bh(&br->lock);
 diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
 index 60b6fe2..c98b3e5 100644
 --- a/net/bridge/br_stp_timer.c
 +++ b/net/bridge/br_stp_timer.c
 @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
 if (br->dev->flags & IFF_UP) {
 br_config_bpdu_generation(br);

 -   if (br->stp_enabled == BR_KERNEL_STP)
 +   if (br->stp_enabled != BR_USER_STP)
 mod_timer(&br->hello_timer,
   round_jiffies(jiffies + br->hello_time));


 let's see if it hangs when starting the timer. Thanks.
>>>
>>> No hang either:
>>>
>> [snip]
>> Could you please try the patch below ?
>>
>> ---
>>
>> diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
>> index 4efd5d54498a..89110319ef0f 100644
>> --- a/net/bridge/br_stp_if.c
>> +++ b/net/bridge/br_stp_if.c
>> @@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
>>  br_debug(br, "using kernel STP\n");
>>  
>>  /* To start timers on any ports left in blocking */
>> -mod_timer(&br->hello_timer, jiffies + br->hello_time);
>> +if (br->dev->flags & IFF_UP)
>> +mod_timer(&br->hello_timer, jiffies + br->hello_time);
>>  br_port_state_selection(br);
>>  }
>>  
>>
> 
> Ah nevermind, this patch reverts it back to the previous state.
> 

Okay, I saw the problem and can reliably reproduce it. I will send a fix for 
testing
in a few minutes. I think the issue is that the timer can be started before the 
bridge
even goes up, i.e. create bridge -> brctl stp br0 on -> ip l del br0
so the del_timer_sync() doesn't get executed and thus it's still armed.

$ while :; do ip l add br0 type bridge hello_time 100; brctl stp br0 on; ip l 
del br0; done;





Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov
On 01/06/17 17:00, Nikolay Aleksandrov wrote:
> On 01/06/17 15:34, Sebastian Ott wrote:
>> On Thu, 1 Jun 2017, Xin Long wrote:
>>> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>>>  wrote:
 [...]
>>> I couldn't see any bridge-related thing here, and it couldn't be reproduced
>>> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
>>> is something else in you machine.
>>>
>>> With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
>>> your
>>> machine normally, then:
>>> # brctl addbr br0
>>> # ip link set br0 up
>>> # brctl stp br0 on
>>>
>>> to check if it will still hang.
>>
>> Nope. That doesn't hang.
>>
>>
>>> If it can't be reproduced in this way, pls add this on your kernel:
>>>
>>> --- a/net/bridge/br_stp_if.c
>>> +++ b/net/bridge/br_stp_if.c
>>> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
>>> br->stp_enabled = BR_KERNEL_STP;
>>> br_debug(br, "using kernel STP\n");
>>>
>>> +   WARN_ON(1);
>>> /* To start timers on any ports left in blocking */
>>> mod_timer(&br->hello_timer, jiffies + br->hello_time);
>>> br_port_state_selection(br);
>>> +   pr_warn("hello timer start done\n");
>>> }
>>>
>>> spin_unlock_bh(&br->lock);
>>> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
>>> index 60b6fe2..c98b3e5 100644
>>> --- a/net/bridge/br_stp_timer.c
>>> +++ b/net/bridge/br_stp_timer.c
>>> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
>>> if (br->dev->flags & IFF_UP) {
>>> br_config_bpdu_generation(br);
>>>
>>> -   if (br->stp_enabled == BR_KERNEL_STP)
>>> +   if (br->stp_enabled != BR_USER_STP)
>>> mod_timer(&br->hello_timer,
>>>   round_jiffies(jiffies + br->hello_time));
>>>
>>>
>>> let's see if it hangs when starting the timer. Thanks.
>>
>> No hang either:
>>
> [snip]
> Could you please try the patch below ?
> 
> ---
> 
> diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
> index 4efd5d54498a..89110319ef0f 100644
> --- a/net/bridge/br_stp_if.c
> +++ b/net/bridge/br_stp_if.c
> @@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
>   br_debug(br, "using kernel STP\n");
>  
>   /* To start timers on any ports left in blocking */
> - mod_timer(&br->hello_timer, jiffies + br->hello_time);
> + if (br->dev->flags & IFF_UP)
> + mod_timer(&br->hello_timer, jiffies + br->hello_time);
>   br_port_state_selection(br);
>   }
>  
> 

Ah nevermind, this patch reverts it back to the previous state.



Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov
On 01/06/17 15:34, Sebastian Ott wrote:
> On Thu, 1 Jun 2017, Xin Long wrote:
>> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>>  wrote:
>>> [...]
>> I couldn't see any bridge-related thing here, and it couldn't be reproduced
>> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
>> is something else in you machine.
>>
>> With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
>> your
>> machine normally, then:
>> # brctl addbr br0
>> # ip link set br0 up
>> # brctl stp br0 on
>>
>> to check if it will still hang.
> 
> Nope. That doesn't hang.
> 
> 
>> If it can't be reproduced in this way, pls add this on your kernel:
>>
>> --- a/net/bridge/br_stp_if.c
>> +++ b/net/bridge/br_stp_if.c
>> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
>> br->stp_enabled = BR_KERNEL_STP;
>> br_debug(br, "using kernel STP\n");
>>
>> +   WARN_ON(1);
>> /* To start timers on any ports left in blocking */
>> mod_timer(&br->hello_timer, jiffies + br->hello_time);
>> br_port_state_selection(br);
>> +   pr_warn("hello timer start done\n");
>> }
>>
>> spin_unlock_bh(&br->lock);
>> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
>> index 60b6fe2..c98b3e5 100644
>> --- a/net/bridge/br_stp_timer.c
>> +++ b/net/bridge/br_stp_timer.c
>> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
>> if (br->dev->flags & IFF_UP) {
>> br_config_bpdu_generation(br);
>>
>> -   if (br->stp_enabled == BR_KERNEL_STP)
>> +   if (br->stp_enabled != BR_USER_STP)
>> mod_timer(&br->hello_timer,
>>   round_jiffies(jiffies + br->hello_time));
>>
>>
>> let's see if it hangs when starting the timer. Thanks.
> 
> No hang either:
> 
[snip]
Could you please try the patch below ?

---

diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
index 4efd5d54498a..89110319ef0f 100644
--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
br_debug(br, "using kernel STP\n");
 
/* To start timers on any ports left in blocking */
-   mod_timer(&br->hello_timer, jiffies + br->hello_time);
+   if (br->dev->flags & IFF_UP)
+   mod_timer(&br->hello_timer, jiffies + br->hello_time);
br_port_state_selection(br);
}
 


Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Sebastian Ott
On Thu, 1 Jun 2017, Xin Long wrote:
> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>  wrote:
> > [...]
> I couldn't see any bridge-related thing here, and it couldn't be reproduced
> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
> is something else in you machine.
> 
> With the latest upstream kernel, can you remove libvirt (virbr0) and boot your
> machine normally, then:
> # brctl addbr br0
> # ip link set br0 up
> # brctl stp br0 on
> 
> to check if it will still hang.

Nope. That doesn't hang.


> If it can't be reproduced in this way, pls add this on your kernel:
> 
> --- a/net/bridge/br_stp_if.c
> +++ b/net/bridge/br_stp_if.c
> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
> br->stp_enabled = BR_KERNEL_STP;
> br_debug(br, "using kernel STP\n");
> 
> +   WARN_ON(1);
> /* To start timers on any ports left in blocking */
> mod_timer(&br->hello_timer, jiffies + br->hello_time);
> br_port_state_selection(br);
> +   pr_warn("hello timer start done\n");
> }
> 
> spin_unlock_bh(&br->lock);
> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
> index 60b6fe2..c98b3e5 100644
> --- a/net/bridge/br_stp_timer.c
> +++ b/net/bridge/br_stp_timer.c
> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
> if (br->dev->flags & IFF_UP) {
> br_config_bpdu_generation(br);
> 
> -   if (br->stp_enabled == BR_KERNEL_STP)
> +   if (br->stp_enabled != BR_USER_STP)
> mod_timer(&br->hello_timer,
>   round_jiffies(jiffies + br->hello_time));
> 
> 
> let's see if it hangs when starting the timer. Thanks.

No hang either:

[  134.018104] [ cut here ]
[  134.018144] WARNING: CPU: 1 PID: 1339 at net/bridge/br_stp_if.c:181 
br_stp_set_enabled+0x154/0x2b0 [bridge]
[  134.018149] Modules linked in: bridge stp llc rdma_ucm ib_ucm ib_uverbs [...]
[  134.018257] CPU: 1 PID: 1339 Comm: brctl Not tainted 
4.12.0-rc3-00011-gf511c0b-dirty #587
[  134.018262] Hardware name: IBM 2827 H66 705 (LPAR)
[  134.018266] task: d141c100 task.stack: d143
[  134.018271] Krnl PSW : 0704f0018000 03ff802bc4c4 
(br_stp_set_enabled+0x154/0x2b0 [bridge])
[  134.018286]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 
RI:0 EA:3
[  134.018294] Krnl GPRS: c5eae501 05dc 0bb8 
0001
[  134.018298]03ff802bc42c d1433c78 0001 
d3ad2d60
[  134.018303]0002 03ff802c21a8 d3ad2d60 
fffe
[  134.018308]d1671738 26a0 03ff802bc42c 
d1433c38
[  134.018320] Krnl Code: 03ff802bc4b4: e54ca9180001mvhi
2328(%r10),1
  03ff802bc4ba: c004brcl
0,3ff802bc4ba
 #03ff802bc4c0: a7f40001brc 
15,3ff802bc4c2
 >03ff802bc4c4: c418b5aalgrl
%r1,3ff802b3018
  03ff802bc4ca: 4120ac10la  
%r2,3088(%r10)
  03ff802bc4ce: e3301004lg  
%r3,0(%r1)
  03ff802bc4d4: e330a8d80008ag  
%r3,2264(%r10)
  03ff802bc4da: c0e5bc8bbrasl   
%r14,3ff802b3df0
[  134.018374] Call Trace:
[  134.018384] ([<03ff802bc42c>] br_stp_set_enabled+0xbc/0x2b0 [bridge])
[  134.018393]  [<03ff802c21d2>] set_stp_state+0x2a/0x40 [bridge] 
[  134.018402]  [<03ff802c0f30>] store_bridge_parm+0xa8/0xf8 [bridge] 
[  134.018410]  [<004012f2>] kernfs_fop_write+0x132/0x208 
[  134.018417]  [<0036088e>] __vfs_write+0x36/0x140 
[  134.018422]  [<00361b54>] vfs_write+0xbc/0x1a0 
[  134.018427]  [<0036323e>] SyS_write+0x66/0xc0 
[  134.018434]  [<008ccc80>] system_call+0xc4/0x28c 
[  134.018438] 5 locks held by brctl/1339:
[  134.018443]  #0:  (sb_writers#5){.+.+.+}, at: [<00361b3e>] 
vfs_write+0xa6/0x1a0
[  134.018462]  #1:  (&of->mutex){+.+.+.}, at: [<00401372>] 
kernfs_fop_write+0x1b2/0x208
[  134.018478]  #2:  (s_active#116){.+.+.+}, at: [<0040137e>] 
kernfs_fop_write+0x1be/0x208
[  134.018496]  #3:  (rtnl_mutex){+.+.+.}, at: [<03ff802c0f08>] 
store_bridge_parm+0x80/0xf8 [bridge]
[  134.018517]  #4:  (&(&br->lock)->rlock){+.}, at: [<03ff802bc42c>] 
br_stp_set_enabled+0xbc/0x2b0 [bridge]
[  134.018537] Last Breaking-Event-Address:
[  134.018546]  [<03ff802bc4c0>] br_stp_set_enabled+0x150/0x2b0 [bridge]
[  134.018551] ---[ end trace 0fc342e82de9b3d7 ]---
[  134.018638] hello timer start done


In the system dump I observed that 3 cpus are within mod_timer (different
timers) and spin for some lock (one o

Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Xin Long
On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
 wrote:
[...]
>
> A system running v4.12-rc3-11-gf511c0b on s390 hangs after boot with no
> messages on the console. The message buffer obtained via a system dump
> looked like this:
>
> [...]
> [   17.870712] virbr0: port 1(virbr0-nic) entered disabled state
> [   19.618523] Unable to handle kernel pointer dereference in virtual kernel 
> address space
> [  250.028426] INFO: task jbd2/dasda1-8:100 blocked for more than 120 seconds.
> [  250.028427]   Not tainted 4.12.0-rc3-00011-gf511c0b #573
> [  250.028428] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  250.028429] jbd2/dasda1-8   D12808   100  2 0x
> [  250.028437] Stack:
> [  250.028437]e8c4f9b0  00233afe 
> e8c48100
> [  250.028441]e8c4f978 001b1c98 e8c4f978 
> e8c4f9d8
> [  250.028444]0400efdcce00 e8c48890  
> efdcce18
> [  250.028447]e8c48100 efdcce00 e8ce8100 
> e73c6900
> [  250.028450]008da090 008c4f54 e8c4f9d8 
> e8c4fa60
> [  250.028453] Call Trace:
> [  250.028458] ([<008c4f54>] __schedule+0xb14/0xc90)
> [  250.028459]  [<008c5164>] schedule+0x94/0xc0
> [  250.028462]  [<001802ac>] io_schedule+0x34/0x58
> [  250.028464]  [<002a44c2>] wait_on_page_bit+0x16a/0x198
> [  250.028465]  [<002a4576>] __filemap_fdatawait_range+0x86/0x188
> [  250.028467]  [<002a46a6>] filemap_fdatawait_range+0x2e/0x58
> [  250.028471]  [<004719d4>] 
> jbd2_journal_commit_transaction+0x10e4/0x2200
> [  250.028473]  [<0047890a>] kjournald2+0xda/0x2c0
> [  250.028475]  [<0016da5e>] kthread+0x166/0x178
> [  250.028477]  [<008cce7a>] kernel_thread_starter+0x6/0xc
> [  250.028479]  [<008cce74>] kernel_thread_starter+0x0/0xc
> [  250.028480] INFO: lockdep is turned off.
> [...]
I couldn't see any bridge-related thing here, and it couldn't be reproduced
with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
is something else in you machine.

With the latest upstream kernel, can you remove libvirt (virbr0) and boot your
machine normally, then:
# brctl addbr br0
# ip link set br0 up
# brctl stp br0 on

to check if it will still hang.

If it can't be reproduced in this way, pls add this on your kernel:

--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
br->stp_enabled = BR_KERNEL_STP;
br_debug(br, "using kernel STP\n");

+   WARN_ON(1);
/* To start timers on any ports left in blocking */
mod_timer(&br->hello_timer, jiffies + br->hello_time);
br_port_state_selection(br);
+   pr_warn("hello timer start done\n");
}

spin_unlock_bh(&br->lock);
diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
index 60b6fe2..c98b3e5 100644
--- a/net/bridge/br_stp_timer.c
+++ b/net/bridge/br_stp_timer.c
@@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
if (br->dev->flags & IFF_UP) {
br_config_bpdu_generation(br);

-   if (br->stp_enabled == BR_KERNEL_STP)
+   if (br->stp_enabled != BR_USER_STP)
mod_timer(&br->hello_timer,
  round_jiffies(jiffies + br->hello_time));


let's see if it hangs when starting the timer. Thanks.