Re: 5.3-rc3-ish VM crash: RIP: 0010:tcp_trim_head+0x20/0xe0

2019-08-24 Thread Sander Eikelenboom
On 17/08/2019 18:35, Eric Dumazet wrote:
> 
> 
> On 8/17/19 10:24 AM, Sander Eikelenboom wrote:
>> On 12/08/2019 19:56, Eric Dumazet wrote:
>>>
>>>
>>> On 8/12/19 2:50 PM, Sander Eikelenboom wrote:
>>>> L.S.,
>>>>
>>>> While testing a somewhere-after-5.3-rc3 kernel (which included the latest 
>>>> net merge (33920f1ec5bf47c5c0a1d2113989bdd9dfb3fae9),
>>>> one of my Xen VM's (which gets quite some network load) crashed.
>>>> See below for the stacktrace.
>>>>
>>>> Unfortunately I haven't got a clear trigger, so bisection doesn't seem to 
>>>> be an option at the moment. 
>>>> I haven't encountered this on 5.2, so it seems to be an regression against 
>>>> 5.2.
>>>>
>>>> Any ideas ?
>>>>
>>>> --
>>>> Sander
>>>>
>>>>
>>>> [16930.653595] general protection fault:  [#1] SMP NOPTI
>>>> [16930.653624] CPU: 0 PID: 3275 Comm: rsync Not tainted 
>>>> 5.3.0-rc3-20190809-doflr+ #1
>>>> [16930.653657] RIP: 0010:tcp_trim_head+0x20/0xe0
>>>> [16930.653677] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 
>>>> fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 
>>>> <8b> 40 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
>>>> [16930.653741] RSP: :c9003ad8 EFLAGS: 00010286
>>>> [16930.653762] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 
>>>> 801b
>>>
>>> crash in " mov0x20(%rax),%eax"   and RAX=fffe888005bf62c0 (not a valid 
>>> kernel address)
>>>
>>> Look like one bit corruption maybe.
>>>
>>> Nothing comes to mind really between 5.2 and 53 that could explain this.
>>>
>>>> [16930.653791] RDX: 05a0 RSI: 8880115fb800 RDI: 
>>>> 888016b00880
>>>> [16930.653819] RBP: 888016b00880 R08: 0001 R09: 
>>>> 
>>>> [16930.653848] R10: 88800ae00800 R11: bfe632e6 R12: 
>>>> 05a0
>>>> [16930.653875] R13: 0001 R14: bfe62d46 R15: 
>>>> 0004
>>>> [16930.653913] FS:  7fe71fe2cb80() GS:88801f20() 
>>>> knlGS:
>>>> [16930.653943] CS:  0010 DS:  ES:  CR0: 80050033
>>>> [16930.653965] CR2: 55de0f3e7000 CR3: 11f32000 CR4: 
>>>> 06f0
>>>> [16930.653993] Call Trace:
>>>> [16930.654005]  
>>>> [16930.654018]  tcp_ack+0xbb0/0x1230
>>>> [16930.654033]  tcp_rcv_established+0x2e8/0x630
>>>> [16930.654053]  tcp_v4_do_rcv+0x129/0x1d0
>>>> [16930.654070]  tcp_v4_rcv+0xac9/0xcb0
>>>> [16930.654088]  ip_protocol_deliver_rcu+0x27/0x1b0
>>>> [16930.654109]  ip_local_deliver_finish+0x3f/0x50
>>>> [16930.654128]  ip_local_deliver+0x4d/0xe0
>>>> [16930.654145]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
>>>> [16930.654163]  ip_rcv+0x4c/0xd0
>>>> [16930.654179]  __netif_receive_skb_one_core+0x79/0x90
>>>> [16930.654200]  netif_receive_skb_internal+0x2a/0xa0
>>>> [16930.654219]  napi_gro_receive+0xe7/0x140
>>>> [16930.654237]  xennet_poll+0x9be/0xae0
>>>> [16930.654254]  net_rx_action+0x136/0x340
>>>> [16930.654271]  __do_softirq+0xdd/0x2cf
>>>> [16930.654287]  irq_exit+0x7a/0xa0
>>>> [16930.654304]  xen_evtchn_do_upcall+0x27/0x40
>>>> [16930.654320]  xen_hvm_callback_vector+0xf/0x20
>>>> [16930.654339]  
>>>> [16930.654349] RIP: 0033:0x55de0d87db99
>>>> [16930.654364] Code: 00 00 48 89 7c 24 f8 45 39 fe 45 0f 42 fe 44 89 7c 24 
>>>> f4 eb 09 0f 1f 40 00 83 e9 01 74 3e 89 f2 48 63 f8 4c 01 d2 44 38 1c 3a 
>>>> <75> 25 44 38 6c 3a ff 75 1e 41 0f b6 3c 24 40 38 3a 75 14 41 0f b6
>>>> [16930.654432] RSP: 002b:7ffd5531eec8 EFLAGS: 0a87 ORIG_RAX: 
>>>> ff0c
>>>> [16930.655004] RAX: 0002 RBX: 55de0f3e8e50 RCX: 
>>>> 007f
>>>> [16930.655034] RDX: 55de0f3dc2d2 RSI: 3492 RDI: 
>>>> 0002
>>>> [16930.655062] RBP: 7fff R08: 80ea R09: 
>>>> 01f0
>>>> [16930.655089] R10: 55de0f3d8e40 R11: 0094 R12: 
>>>> 55de0f3e0f2a
>>>> [16930.655116] R13

Re: 5.3-rc3-ish VM crash: RIP: 0010:tcp_trim_head+0x20/0xe0

2019-08-17 Thread Sander Eikelenboom
On 12/08/2019 19:56, Eric Dumazet wrote:
> 
> 
> On 8/12/19 2:50 PM, Sander Eikelenboom wrote:
>> L.S.,
>>
>> While testing a somewhere-after-5.3-rc3 kernel (which included the latest 
>> net merge (33920f1ec5bf47c5c0a1d2113989bdd9dfb3fae9),
>> one of my Xen VM's (which gets quite some network load) crashed.
>> See below for the stacktrace.
>>
>> Unfortunately I haven't got a clear trigger, so bisection doesn't seem to be 
>> an option at the moment. 
>> I haven't encountered this on 5.2, so it seems to be an regression against 
>> 5.2.
>>
>> Any ideas ?
>>
>> --
>> Sander
>>
>>
>> [16930.653595] general protection fault:  [#1] SMP NOPTI
>> [16930.653624] CPU: 0 PID: 3275 Comm: rsync Not tainted 
>> 5.3.0-rc3-20190809-doflr+ #1
>> [16930.653657] RIP: 0010:tcp_trim_head+0x20/0xe0
>> [16930.653677] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 
>> fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 
>> 40 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
>> [16930.653741] RSP: :c9003ad8 EFLAGS: 00010286
>> [16930.653762] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 
>> 801b
> 
> crash in " mov0x20(%rax),%eax"   and RAX=fffe888005bf62c0 (not a valid 
> kernel address)
> 
> Look like one bit corruption maybe.
> 
> Nothing comes to mind really between 5.2 and 53 that could explain this.
> 
>> [16930.653791] RDX: 05a0 RSI: 8880115fb800 RDI: 
>> 888016b00880
>> [16930.653819] RBP: 888016b00880 R08: 0001 R09: 
>> 
>> [16930.653848] R10: 88800ae00800 R11: bfe632e6 R12: 
>> 05a0
>> [16930.653875] R13: 0001 R14: bfe62d46 R15: 
>> 0004
>> [16930.653913] FS:  7fe71fe2cb80() GS:88801f20() 
>> knlGS:
>> [16930.653943] CS:  0010 DS:  ES:  CR0: 80050033
>> [16930.653965] CR2: 55de0f3e7000 CR3: 11f32000 CR4: 
>> 06f0
>> [16930.653993] Call Trace:
>> [16930.654005]  
>> [16930.654018]  tcp_ack+0xbb0/0x1230
>> [16930.654033]  tcp_rcv_established+0x2e8/0x630
>> [16930.654053]  tcp_v4_do_rcv+0x129/0x1d0
>> [16930.654070]  tcp_v4_rcv+0xac9/0xcb0
>> [16930.654088]  ip_protocol_deliver_rcu+0x27/0x1b0
>> [16930.654109]  ip_local_deliver_finish+0x3f/0x50
>> [16930.654128]  ip_local_deliver+0x4d/0xe0
>> [16930.654145]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
>> [16930.654163]  ip_rcv+0x4c/0xd0
>> [16930.654179]  __netif_receive_skb_one_core+0x79/0x90
>> [16930.654200]  netif_receive_skb_internal+0x2a/0xa0
>> [16930.654219]  napi_gro_receive+0xe7/0x140
>> [16930.654237]  xennet_poll+0x9be/0xae0
>> [16930.654254]  net_rx_action+0x136/0x340
>> [16930.654271]  __do_softirq+0xdd/0x2cf
>> [16930.654287]  irq_exit+0x7a/0xa0
>> [16930.654304]  xen_evtchn_do_upcall+0x27/0x40
>> [16930.654320]  xen_hvm_callback_vector+0xf/0x20
>> [16930.654339]  
>> [16930.654349] RIP: 0033:0x55de0d87db99
>> [16930.654364] Code: 00 00 48 89 7c 24 f8 45 39 fe 45 0f 42 fe 44 89 7c 24 
>> f4 eb 09 0f 1f 40 00 83 e9 01 74 3e 89 f2 48 63 f8 4c 01 d2 44 38 1c 3a <75> 
>> 25 44 38 6c 3a ff 75 1e 41 0f b6 3c 24 40 38 3a 75 14 41 0f b6
>> [16930.654432] RSP: 002b:7ffd5531eec8 EFLAGS: 0a87 ORIG_RAX: 
>> ff0c
>> [16930.655004] RAX: 0002 RBX: 55de0f3e8e50 RCX: 
>> 007f
>> [16930.655034] RDX: 55de0f3dc2d2 RSI: 3492 RDI: 
>> 0002
>> [16930.655062] RBP: 7fff R08: 80ea R09: 
>> 01f0
>> [16930.655089] R10: 55de0f3d8e40 R11: 0094 R12: 
>> 55de0f3e0f2a
>> [16930.655116] R13: 0010 R14: 7f16 R15: 
>> 0080
>> [16930.655144] Modules linked in:
>> [16930.655200] ---[ end trace 533367c95501b645 ]---
>> [16930.655223] RIP: 0010:tcp_trim_head+0x20/0xe0
>> [16930.655243] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 
>> fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 
>> 40 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
>> [16930.655312] RSP: :c9003ad8 EFLAGS: 00010286
>> [16930.655331] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 
>> 801b
>> [16930.655360] RDX: 05a0 RSI: 8880115fb800 RDI: 
>> 888016b00880
>> [16930.655387] RBP: 888016b00880 R08: 0001 R

Re: 5.3-rc3-ish VM crash: RIP: 0010:tcp_trim_head+0x20/0xe0

2019-08-12 Thread Sander Eikelenboom
On 12/08/2019 19:56, Eric Dumazet wrote:
> 
> 
> On 8/12/19 2:50 PM, Sander Eikelenboom wrote:
>> L.S.,
>>
>> While testing a somewhere-after-5.3-rc3 kernel (which included the latest 
>> net merge (33920f1ec5bf47c5c0a1d2113989bdd9dfb3fae9),
>> one of my Xen VM's (which gets quite some network load) crashed.
>> See below for the stacktrace.
>>
>> Unfortunately I haven't got a clear trigger, so bisection doesn't seem to be 
>> an option at the moment. 
>> I haven't encountered this on 5.2, so it seems to be an regression against 
>> 5.2.
>>
>> Any ideas ?
>>
>> --
>> Sander
>>
>>
>> [16930.653595] general protection fault:  [#1] SMP NOPTI
>> [16930.653624] CPU: 0 PID: 3275 Comm: rsync Not tainted 
>> 5.3.0-rc3-20190809-doflr+ #1
>> [16930.653657] RIP: 0010:tcp_trim_head+0x20/0xe0
>> [16930.653677] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 
>> fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 
>> 40 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
>> [16930.653741] RSP: :c9003ad8 EFLAGS: 00010286
>> [16930.653762] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 
>> 801b
> 
> crash in " mov0x20(%rax),%eax"   and RAX=fffe888005bf62c0 (not a valid 
> kernel address)
> 
> Look like one bit corruption maybe.
> 
> Nothing comes to mind really between 5.2 and 53 that could explain this.

Hi Eric,

Hmm could be it's a rare coincidence, sp that it just never occurred on pre 5.3 
by chance.
Let's wait and see if it reoccurs, will report back if it does.

Thanks for your explanation.

--
Sander


>> [16930.653791] RDX: 05a0 RSI: 8880115fb800 RDI: 
>> 888016b00880
>> [16930.653819] RBP: 888016b00880 R08: 0001 R09: 
>> 
>> [16930.653848] R10: 88800ae00800 R11: bfe632e6 R12: 
>> 05a0
>> [16930.653875] R13: 0001 R14: bfe62d46 R15: 
>> 0004
>> [16930.653913] FS:  7fe71fe2cb80() GS:88801f20() 
>> knlGS:
>> [16930.653943] CS:  0010 DS:  ES:  CR0: 80050033
>> [16930.653965] CR2: 55de0f3e7000 CR3: 11f32000 CR4: 
>> 06f0
>> [16930.653993] Call Trace:
>> [16930.654005]  
>> [16930.654018]  tcp_ack+0xbb0/0x1230
>> [16930.654033]  tcp_rcv_established+0x2e8/0x630
>> [16930.654053]  tcp_v4_do_rcv+0x129/0x1d0
>> [16930.654070]  tcp_v4_rcv+0xac9/0xcb0
>> [16930.654088]  ip_protocol_deliver_rcu+0x27/0x1b0
>> [16930.654109]  ip_local_deliver_finish+0x3f/0x50
>> [16930.654128]  ip_local_deliver+0x4d/0xe0
>> [16930.654145]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
>> [16930.654163]  ip_rcv+0x4c/0xd0
>> [16930.654179]  __netif_receive_skb_one_core+0x79/0x90
>> [16930.654200]  netif_receive_skb_internal+0x2a/0xa0
>> [16930.654219]  napi_gro_receive+0xe7/0x140
>> [16930.654237]  xennet_poll+0x9be/0xae0
>> [16930.654254]  net_rx_action+0x136/0x340
>> [16930.654271]  __do_softirq+0xdd/0x2cf
>> [16930.654287]  irq_exit+0x7a/0xa0
>> [16930.654304]  xen_evtchn_do_upcall+0x27/0x40
>> [16930.654320]  xen_hvm_callback_vector+0xf/0x20
>> [16930.654339]  
>> [16930.654349] RIP: 0033:0x55de0d87db99
>> [16930.654364] Code: 00 00 48 89 7c 24 f8 45 39 fe 45 0f 42 fe 44 89 7c 24 
>> f4 eb 09 0f 1f 40 00 83 e9 01 74 3e 89 f2 48 63 f8 4c 01 d2 44 38 1c 3a <75> 
>> 25 44 38 6c 3a ff 75 1e 41 0f b6 3c 24 40 38 3a 75 14 41 0f b6
>> [16930.654432] RSP: 002b:7ffd5531eec8 EFLAGS: 0a87 ORIG_RAX: 
>> ff0c
>> [16930.655004] RAX: 0002 RBX: 55de0f3e8e50 RCX: 
>> 007f
>> [16930.655034] RDX: 55de0f3dc2d2 RSI: 3492 RDI: 
>> 0002
>> [16930.655062] RBP: 7fff R08: 80ea R09: 
>> 01f0
>> [16930.655089] R10: 55de0f3d8e40 R11: 0094 R12: 
>> 55de0f3e0f2a
>> [16930.655116] R13: 0010 R14: 7f16 R15: 
>> 0080
>> [16930.655144] Modules linked in:
>> [16930.655200] ---[ end trace 533367c95501b645 ]---
>> [16930.655223] RIP: 0010:tcp_trim_head+0x20/0xe0
>> [16930.655243] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 
>> fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 
>> 40 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
>> [16930.655312] RSP: :c9003ad8 EFLAGS: 00010286
>> [16930.655331] RAX: fffe888005bf62c0 RBX: 88801

5.3-rc3-ish VM crash: RIP: 0010:tcp_trim_head+0x20/0xe0

2019-08-12 Thread Sander Eikelenboom
L.S.,

While testing a somewhere-after-5.3-rc3 kernel (which included the latest net 
merge (33920f1ec5bf47c5c0a1d2113989bdd9dfb3fae9),
one of my Xen VM's (which gets quite some network load) crashed.
See below for the stacktrace.

Unfortunately I haven't got a clear trigger, so bisection doesn't seem to be an 
option at the moment. 
I haven't encountered this on 5.2, so it seems to be an regression against 5.2.

Any ideas ?

--
Sander


[16930.653595] general protection fault:  [#1] SMP NOPTI
[16930.653624] CPU: 0 PID: 3275 Comm: rsync Not tainted 
5.3.0-rc3-20190809-doflr+ #1
[16930.653657] RIP: 0010:tcp_trim_head+0x20/0xe0
[16930.653677] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 fd 
53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 40 20 
66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
[16930.653741] RSP: :c9003ad8 EFLAGS: 00010286
[16930.653762] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 801b
[16930.653791] RDX: 05a0 RSI: 8880115fb800 RDI: 888016b00880
[16930.653819] RBP: 888016b00880 R08: 0001 R09: 
[16930.653848] R10: 88800ae00800 R11: bfe632e6 R12: 05a0
[16930.653875] R13: 0001 R14: bfe62d46 R15: 0004
[16930.653913] FS:  7fe71fe2cb80() GS:88801f20() 
knlGS:
[16930.653943] CS:  0010 DS:  ES:  CR0: 80050033
[16930.653965] CR2: 55de0f3e7000 CR3: 11f32000 CR4: 06f0
[16930.653993] Call Trace:
[16930.654005]  
[16930.654018]  tcp_ack+0xbb0/0x1230
[16930.654033]  tcp_rcv_established+0x2e8/0x630
[16930.654053]  tcp_v4_do_rcv+0x129/0x1d0
[16930.654070]  tcp_v4_rcv+0xac9/0xcb0
[16930.654088]  ip_protocol_deliver_rcu+0x27/0x1b0
[16930.654109]  ip_local_deliver_finish+0x3f/0x50
[16930.654128]  ip_local_deliver+0x4d/0xe0
[16930.654145]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
[16930.654163]  ip_rcv+0x4c/0xd0
[16930.654179]  __netif_receive_skb_one_core+0x79/0x90
[16930.654200]  netif_receive_skb_internal+0x2a/0xa0
[16930.654219]  napi_gro_receive+0xe7/0x140
[16930.654237]  xennet_poll+0x9be/0xae0
[16930.654254]  net_rx_action+0x136/0x340
[16930.654271]  __do_softirq+0xdd/0x2cf
[16930.654287]  irq_exit+0x7a/0xa0
[16930.654304]  xen_evtchn_do_upcall+0x27/0x40
[16930.654320]  xen_hvm_callback_vector+0xf/0x20
[16930.654339]  
[16930.654349] RIP: 0033:0x55de0d87db99
[16930.654364] Code: 00 00 48 89 7c 24 f8 45 39 fe 45 0f 42 fe 44 89 7c 24 f4 
eb 09 0f 1f 40 00 83 e9 01 74 3e 89 f2 48 63 f8 4c 01 d2 44 38 1c 3a <75> 25 44 
38 6c 3a ff 75 1e 41 0f b6 3c 24 40 38 3a 75 14 41 0f b6
[16930.654432] RSP: 002b:7ffd5531eec8 EFLAGS: 0a87 ORIG_RAX: 
ff0c
[16930.655004] RAX: 0002 RBX: 55de0f3e8e50 RCX: 007f
[16930.655034] RDX: 55de0f3dc2d2 RSI: 3492 RDI: 0002
[16930.655062] RBP: 7fff R08: 80ea R09: 01f0
[16930.655089] R10: 55de0f3d8e40 R11: 0094 R12: 55de0f3e0f2a
[16930.655116] R13: 0010 R14: 7f16 R15: 0080
[16930.655144] Modules linked in:
[16930.655200] ---[ end trace 533367c95501b645 ]---
[16930.655223] RIP: 0010:tcp_trim_head+0x20/0xe0
[16930.655243] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 fd 
53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 40 20 
66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
[16930.655312] RSP: :c9003ad8 EFLAGS: 00010286
[16930.655331] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 801b
[16930.655360] RDX: 05a0 RSI: 8880115fb800 RDI: 888016b00880
[16930.655387] RBP: 888016b00880 R08: 0001 R09: 
[16930.655414] R10: 88800ae00800 R11: bfe632e6 R12: 05a0
[16930.655441] R13: 0001 R14: bfe62d46 R15: 0004
[16930.655475] FS:  7fe71fe2cb80() GS:88801f20() 
knlGS:
[16930.655502] CS:  0010 DS:  ES:  CR0: 80050033
[16930.655525] CR2: 55de0f3e7000 CR3: 11f32000 CR4: 06f0
[16930.63] Kernel panic - not syncing: Fatal exception in interrupt
[16930.655789] Kernel Offset: disabled


Re: RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0

2019-08-09 Thread Sander Eikelenboom
On 08/08/2019 12:21, Paolo Valente wrote:
> 
> 
>> Il giorno 8 ago 2019, alle ore 12:21, Sander Eikelenboom 
>>  ha scritto:
>>
>> On 08/08/2019 11:10, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 8 ago 2019, alle ore 11:05, Sander Eikelenboom 
>>>>  ha scritto:
>>>>
>>>> L.S.,
>>>>
>>>> While testing a linux 5.3-rc3 kernel on my Xen server I come across the 
>>>> splat below when trying to shutdown all the VM's.
>>>> This is after the server has ran for a few days without any problem. It 
>>>> seems to happen consistently.
>>>>
>>>> It seems it's in the same area as 
>>>> dbc3117d4ca9e17819ac73501e914b8422686750, but already rc3 incorporates 
>>>> that patch.
>>>>
>>>> Any ideas ?
>>>>
>>>
>>> Could you try these fixes I proposed yesterday:
>>> https://lkml.org/lkml/2019/8/7/536
>>> or, on patchwork:
>>> https://patchwork.kernel.org/patch/11082247/
>>> https://patchwork.kernel.org/patch/11082249/
>>
>> Hi Paolo,
>>
>> These two above seem to fix the issue !
>> So thanks for the swift reply (and the patchwork links for easy
>> downloading the patches).
>>
>> I will test the third unrelated patch as well, but if you don't hear
>> back , it's all good.
>>
> 
> Great! Thank you for offering to test also the other patch. Tested-by are 
> welcome too :)

Hi,

Haven't seen any problems with the patch so far, but haven't tested it
on constraint memory, so i don't think a tested-by is justified in this
case.

--
Sander

> Thanks,
> Paolo
> 
>> Thanks again !
>>
>> --
>> Sander
>>
>>> I posted a further fix too, which should be unrelated. But, just in case:
>>> https://lkml.org/lkml/2019/8/7/715
>>> or, on patchwork:
>>> https://patchwork.kernel.org/patch/11082521/
>>>
>>> Crossing my fingers (and think you for reporting this),
>>> Paolo
>>>
>>>> --
>>>> Sander
>>>>
>>>>
>>>> [80915.716048] BUG: unable to handle page fault for address: 
>>>> 1008
>>>> [80915.724188] #PF: supervisor write access in kernel mode
>>>> [80915.733182] #PF: error_code(0x0002) - not-present page
>>>> [80915.741455] PGD 0 P4D 0 
>>>> [80915.750538] Oops: 0002 [#1] SMP NOPTI
>>>> [80915.758425] CPU: 4 PID: 11407 Comm: 17.hda-2 Tainted: GW
>>>>  5.3.0-rc3-20190807-doflr+ #1
>>>> [80915.766137] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>>>> V1.8B1 09/13/2010
>>>> [80915.773737] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
>>>> [80915.781294] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 
>>>> f0 01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 
>>>> <48> 89 4e 08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
>>>> [80915.796792] RSP: e02b:c9000473be28 EFLAGS: 00010006
>>>> [80915.804419] RAX: 888070393200 RBX: 888076c4a800 RCX: 
>>>> 888076c4a9f8
>>>> [80915.810254] device vif17.0 left promiscuous mode
>>>> [80915.811906] RDX: 1000 RSI: 1000 RDI: 
>>>> 
>>>> [80915.811908] RBP: 888077efc398 R08: 0004 R09: 
>>>> 81106800
>>>> [80915.811909] R10: 88807804ca40 R11: c9000473be31 R12: 
>>>> 888005256bf0
>>>> [80915.811909] R13:  R14: 888005256800 R15: 
>>>> 82a6a3c0
>>>> [80915.811919] FS:  7f1c30a8dbc0() GS:88807d50() 
>>>> knlGS:
>>>> [80915.819456] xen_bridge: port 18(vif17.0) entered disabled state
>>>> [80915.826569] CS:  1e030 DS:  ES:  CR0: 80050033
>>>> [80915.826571] CR2: 1008 CR3: 5d9d CR4: 
>>>> 0660
>>>> [80915.826575] Call Trace:
>>>> [80915.826592]  bfq_exit_icq+0xe/0x20
>>>> [80915.826595]  put_io_context_active+0x52/0x80
>>>> [80915.826599]  do_exit+0x774/0xac0
>>>> [80915.906037]  ? xen_blkif_be_int+0x30/0x30
>>>> [80915.913311]  kthread+0xda/0x130
>>>> [80915.920398]  ? kthread_park+0x80/0x80
>>>> [80915.927524]  ret_from_fork+0x22/0x40
>>>> [80915.934512] Modules linked in:
>>>> [80915.941412] CR2: 1008

Re: RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0

2019-08-08 Thread Sander Eikelenboom
On 08/08/2019 11:10, Paolo Valente wrote:
> 
> 
>> Il giorno 8 ago 2019, alle ore 11:05, Sander Eikelenboom 
>>  ha scritto:
>>
>> L.S.,
>>
>> While testing a linux 5.3-rc3 kernel on my Xen server I come across the 
>> splat below when trying to shutdown all the VM's.
>> This is after the server has ran for a few days without any problem. It 
>> seems to happen consistently.
>>
>> It seems it's in the same area as dbc3117d4ca9e17819ac73501e914b8422686750, 
>> but already rc3 incorporates that patch.
>>
>> Any ideas ?
>>
> 
> Could you try these fixes I proposed yesterday:
> https://lkml.org/lkml/2019/8/7/536
> or, on patchwork:
> https://patchwork.kernel.org/patch/11082247/
> https://patchwork.kernel.org/patch/11082249/

Hi Paolo,

These two above seem to fix the issue !
So thanks for the swift reply (and the patchwork links for easy
downloading the patches).

I will test the third unrelated patch as well, but if you don't hear
back , it's all good.

Thanks again !

--
Sander

> I posted a further fix too, which should be unrelated. But, just in case:
> https://lkml.org/lkml/2019/8/7/715
> or, on patchwork:
> https://patchwork.kernel.org/patch/11082521/
> 
> Crossing my fingers (and think you for reporting this),
> Paolo
> 
>> --
>> Sander
>>
>>
>> [80915.716048] BUG: unable to handle page fault for address: 1008
>> [80915.724188] #PF: supervisor write access in kernel mode
>> [80915.733182] #PF: error_code(0x0002) - not-present page
>> [80915.741455] PGD 0 P4D 0 
>> [80915.750538] Oops: 0002 [#1] SMP NOPTI
>> [80915.758425] CPU: 4 PID: 11407 Comm: 17.hda-2 Tainted: GW 
>> 5.3.0-rc3-20190807-doflr+ #1
>> [80915.766137] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>> V1.8B1 09/13/2010
>> [80915.773737] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
>> [80915.781294] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 
>> f0 01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 <48> 
>> 89 4e 08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
>> [80915.796792] RSP: e02b:c9000473be28 EFLAGS: 00010006
>> [80915.804419] RAX: 888070393200 RBX: 888076c4a800 RCX: 
>> 888076c4a9f8
>> [80915.810254] device vif17.0 left promiscuous mode
>> [80915.811906] RDX: 1000 RSI: 1000 RDI: 
>> 
>> [80915.811908] RBP: 888077efc398 R08: 0004 R09: 
>> 81106800
>> [80915.811909] R10: 88807804ca40 R11: c9000473be31 R12: 
>> 888005256bf0
>> [80915.811909] R13:  R14: 888005256800 R15: 
>> 82a6a3c0
>> [80915.811919] FS:  7f1c30a8dbc0() GS:88807d50() 
>> knlGS:
>> [80915.819456] xen_bridge: port 18(vif17.0) entered disabled state
>> [80915.826569] CS:  1e030 DS:  ES:  CR0: 80050033
>> [80915.826571] CR2: 1008 CR3: 5d9d CR4: 
>> 0660
>> [80915.826575] Call Trace:
>> [80915.826592]  bfq_exit_icq+0xe/0x20
>> [80915.826595]  put_io_context_active+0x52/0x80
>> [80915.826599]  do_exit+0x774/0xac0
>> [80915.906037]  ? xen_blkif_be_int+0x30/0x30
>> [80915.913311]  kthread+0xda/0x130
>> [80915.920398]  ? kthread_park+0x80/0x80
>> [80915.927524]  ret_from_fork+0x22/0x40
>> [80915.934512] Modules linked in:
>> [80915.941412] CR2: 1008
>> [80915.948221] ---[ end trace 61315493e0f8ef40 ]---
>> [80915.954984] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
>> [80915.961850] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 
>> f0 01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 <48> 
>> 89 4e 08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
>> [80915.976124] RSP: e02b:c9000473be28 EFLAGS: 00010006
>> [80915.983205] RAX: 888070393200 RBX: 888076c4a800 RCX: 
>> 888076c4a9f8
>> [80915.990321] RDX: 1000 RSI: 1000 RDI: 
>> 
>> [80915.997319] RBP: 888077efc398 R08: 0004 R09: 
>> 81106800
>> [80916.004427] R10: 88807804ca40 R11: c9000473be31 R12: 
>> 888005256bf0
>> [80916.011525] R13:  R14: 888005256800 R15: 
>> 82a6a3c0
>> [80916.018679] FS:  7f1c30a8dbc0() GS:88807d50() 
>> knlGS:
>> [80916.025897] CS:  1e030 DS:  ES:  CR0: 80050033
>> [80916.033116] CR2: 1008 CR3: 5d9d CR4: 
>> 0660
>> [80916.040348] Fixing recursive fault but reboot is needed!
> 



RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0

2019-08-08 Thread Sander Eikelenboom
L.S.,

While testing a linux 5.3-rc3 kernel on my Xen server I come across the splat 
below when trying to shutdown all the VM's.
This is after the server has ran for a few days without any problem. It seems 
to happen consistently.

It seems it's in the same area as dbc3117d4ca9e17819ac73501e914b8422686750, but 
already rc3 incorporates that patch.

Any ideas ?

--
Sander


[80915.716048] BUG: unable to handle page fault for address: 1008
[80915.724188] #PF: supervisor write access in kernel mode
[80915.733182] #PF: error_code(0x0002) - not-present page
[80915.741455] PGD 0 P4D 0 
[80915.750538] Oops: 0002 [#1] SMP NOPTI
[80915.758425] CPU: 4 PID: 11407 Comm: 17.hda-2 Tainted: GW 
5.3.0-rc3-20190807-doflr+ #1
[80915.766137] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[80915.773737] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
[80915.781294] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 f0 
01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 <48> 89 4e 
08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
[80915.796792] RSP: e02b:c9000473be28 EFLAGS: 00010006
[80915.804419] RAX: 888070393200 RBX: 888076c4a800 RCX: 888076c4a9f8
[80915.810254] device vif17.0 left promiscuous mode
[80915.811906] RDX: 1000 RSI: 1000 RDI: 
[80915.811908] RBP: 888077efc398 R08: 0004 R09: 81106800
[80915.811909] R10: 88807804ca40 R11: c9000473be31 R12: 888005256bf0
[80915.811909] R13:  R14: 888005256800 R15: 82a6a3c0
[80915.811919] FS:  7f1c30a8dbc0() GS:88807d50() 
knlGS:
[80915.819456] xen_bridge: port 18(vif17.0) entered disabled state
[80915.826569] CS:  1e030 DS:  ES:  CR0: 80050033
[80915.826571] CR2: 1008 CR3: 5d9d CR4: 0660
[80915.826575] Call Trace:
[80915.826592]  bfq_exit_icq+0xe/0x20
[80915.826595]  put_io_context_active+0x52/0x80
[80915.826599]  do_exit+0x774/0xac0
[80915.906037]  ? xen_blkif_be_int+0x30/0x30
[80915.913311]  kthread+0xda/0x130
[80915.920398]  ? kthread_park+0x80/0x80
[80915.927524]  ret_from_fork+0x22/0x40
[80915.934512] Modules linked in:
[80915.941412] CR2: 1008
[80915.948221] ---[ end trace 61315493e0f8ef40 ]---
[80915.954984] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
[80915.961850] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 f0 
01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 <48> 89 4e 
08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
[80915.976124] RSP: e02b:c9000473be28 EFLAGS: 00010006
[80915.983205] RAX: 888070393200 RBX: 888076c4a800 RCX: 888076c4a9f8
[80915.990321] RDX: 1000 RSI: 1000 RDI: 
[80915.997319] RBP: 888077efc398 R08: 0004 R09: 81106800
[80916.004427] R10: 88807804ca40 R11: c9000473be31 R12: 888005256bf0
[80916.011525] R13:  R14: 888005256800 R15: 82a6a3c0
[80916.018679] FS:  7f1c30a8dbc0() GS:88807d50() 
knlGS:
[80916.025897] CS:  1e030 DS:  ES:  CR0: 80050033
[80916.033116] CR2: 1008 CR3: 5d9d CR4: 0660
[80916.040348] Fixing recursive fault but reboot is needed!


Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-10 Thread Sander Eikelenboom
On 10/02/2019 12:44, Heiner Kallweit wrote:
> On 10.02.2019 10:16, Sander Eikelenboom wrote:
>> On 09/02/2019 12:50, Heiner Kallweit wrote:
>>> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>>>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top 
>>>>>>>>>>>>>> but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and 
>>>>>>>>>>>>>> bisecting could be nasty due to another (networking related) 
>>>>>>>>>>>>>> kernel bug.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please 
>>>>>>>>>>>>>> feel free to ask.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 
>>>>>>>>>>>>> driver between
>>>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root 
>>>>>>>>>>>>> cause could
>>>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>>>
>>>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded 
>>>>>>>>>>>> mmiowb barriers
>>>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of 
>>>>>>>>>>>> xmit_more and __netdev_sent_queue
>>>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add 
>>>>>>>>>>>> __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>>>
>>>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I 
>>>>>>>>>>> haven't heard about
>>>>>>>>>>> this issue from any user of physical hw. And due to the fact that a 
>>>>>>>>>>> lot of mainboards
>>>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>>>> Does the issue occur under specific circumstances like very high 
>>>>>>>>>>> load?
>>>>>>>>>>
>>>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I 
>>>>>>>>>> remember correctly it occurred while kernel compiling
>>>>>>>>>> on the host.
>>>>>>>>>>
>>>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to 
>>>>>>>>>>> involve Eric Dumazet
>>>>>>>>>>> as author of the underlying changes.
>>>>>>>>>>
>>>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>>>
>>>>>>

Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-10 Thread Sander Eikelenboom
On 10/02/2019 12:44, Heiner Kallweit wrote:
> On 10.02.2019 10:16, Sander Eikelenboom wrote:
>> On 09/02/2019 12:50, Heiner Kallweit wrote:
>>> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>>>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top 
>>>>>>>>>>>>>> but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and 
>>>>>>>>>>>>>> bisecting could be nasty due to another (networking related) 
>>>>>>>>>>>>>> kernel bug.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please 
>>>>>>>>>>>>>> feel free to ask.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 
>>>>>>>>>>>>> driver between
>>>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root 
>>>>>>>>>>>>> cause could
>>>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>>>
>>>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded 
>>>>>>>>>>>> mmiowb barriers
>>>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of 
>>>>>>>>>>>> xmit_more and __netdev_sent_queue
>>>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add 
>>>>>>>>>>>> __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>>>
>>>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I 
>>>>>>>>>>> haven't heard about
>>>>>>>>>>> this issue from any user of physical hw. And due to the fact that a 
>>>>>>>>>>> lot of mainboards
>>>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>>>> Does the issue occur under specific circumstances like very high 
>>>>>>>>>>> load?
>>>>>>>>>>
>>>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I 
>>>>>>>>>> remember correctly it occurred while kernel compiling
>>>>>>>>>> on the host.
>>>>>>>>>>
>>>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to 
>>>>>>>>>>> involve Eric Dumazet
>>>>>>>>>>> as author of the underlying changes.
>>>>>>>>>>
>>>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>>>
>>>>>>

Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-10 Thread Sander Eikelenboom
On 09/02/2019 12:50, Heiner Kallweit wrote:
> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>
>>>>>>
>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>
>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but 
>>>>>>>>>>>> they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and 
>>>>>>>>>>>> bisecting could be nasty due to another (networking related) 
>>>>>>>>>>>> kernel bug.
>>>>>>>>>>>>
>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please 
>>>>>>>>>>>> feel free to ask.
>>>>>>>>>>>>
>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver 
>>>>>>>>>>> between
>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root 
>>>>>>>>>>> cause could
>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>
>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded 
>>>>>>>>>> mmiowb barriers
>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of 
>>>>>>>>>> xmit_more and __netdev_sent_queue
>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add 
>>>>>>>>>> __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>
>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I 
>>>>>>>>> haven't heard about
>>>>>>>>> this issue from any user of physical hw. And due to the fact that a 
>>>>>>>>> lot of mainboards
>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>
>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I 
>>>>>>>> remember correctly it occurred while kernel compiling
>>>>>>>> on the host.
>>>>>>>>
>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to 
>>>>>>>>> involve Eric Dumazet
>>>>>>>>> as author of the underlying changes.
>>>>>>>>
>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>
>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it 
>>>>>>> would be good to
>>>>>>> test also with only 
>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb 
>>>>>>> barriers
>>>>>>> removed.
>>>>>>>
>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>
>>>>>>> Sure, thanks.
>>>>>>>
>>>>>>>> B

Re: Linux 5.0 regression: BUG: unable to handle kernel paging request at ffff888023e26778 RIP: e030:move_page_tables+0x7c1/0xae0

2019-02-09 Thread Sander Eikelenboom
On 09/02/2019 19:48, Juergen Gross wrote:
> On 09/02/2019 19:45, Sander Eikelenboom wrote:
>> On 09/02/2019 09:26, Sander Eikelenboom wrote:
>>> L.S.,
>>>
>>>
>>> While testing a Linux 5.0-rc5-ish kernel (pull of yesterday) with some 
>>> additional patches for
>>> already reported other issues i came across the issue below which i haven't 
>>> seen with 4.20.x
>>>
>>> I haven't got a reproducer so i might be hard to hit it again, 
>>> system is AMD and this is from the host kernel running under
>>> the Xen hypervisor might it matter.
>>
>>> --
>>>
>>> Sander
>>
>> Hi Boris / Juergen,
>>
>> The commit causing this is:
>> 2c91bd4a4e2e530582d6fd643ea7b86b27907151 mm: speed up mremap by 20x on large 
>> regions
>>
>> Since it seems there haven't been any other reports about this .. 
>> could it be this doesn't specifically work well with a Xen PVH dom0 ?
> 
> PVH? Not PV?

Ah sorry, indeed PV !

> 
> Juergen
> 



Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-09 Thread Sander Eikelenboom
On 09/02/2019 10:59, Heiner Kallweit wrote:
> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>> L.S.,
>>>>>>>>>>
>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but 
>>>>>>>>>> they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>
>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and 
>>>>>>>>>> bisecting could be nasty due to another (networking related) kernel 
>>>>>>>>>> bug.
>>>>>>>>>>
>>>>>>>>>> If you need more info, want me to run a debug patch etc., please 
>>>>>>>>>> feel free to ask.
>>>>>>>>>>
>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver 
>>>>>>>>> between
>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause 
>>>>>>>>> could
>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>
>>>>>>>> Hmm i did some diging and i think:
>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb 
>>>>>>>> barriers
>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more 
>>>>>>>> and __netdev_sent_queue
>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add 
>>>>>>>> __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>
>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't 
>>>>>>> heard about
>>>>>>> this issue from any user of physical hw. And due to the fact that a lot 
>>>>>>> of mainboards
>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>
>>>>>> Yep, the box is already quite contented with the Xen VM's and if I 
>>>>>> remember correctly it occurred while kernel compiling
>>>>>> on the host.
>>>>>>
>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to 
>>>>>>> involve Eric Dumazet
>>>>>>> as author of the underlying changes.
>>>>>>
>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>
>>>>> The barriers were removed after adding xmit_more handling. Therefore it 
>>>>> would be good to
>>>>> test also with only 
>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb 
>>>>> barriers
>>>>> removed.
>>>>>
>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>
>>>>> Sure, thanks.
>>>>>
>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>
>>>>> Yes
>>>>>
>>>>>> If so and concluding they revert cleanly, perhaps it should be 
>>>>>> considered at this point in the RC's
>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>
>>>>> Before removing both it would be good to test with only the 
>>>>> barrier-removal removed.
>>>>>
>>>>
>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of 
>>>> xmit

Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-09 Thread Sander Eikelenboom
On 09/02/2019 10:02, Heiner Kallweit wrote:
> On 09.02.2019 00:09, Eric Dumazet wrote:
>>
>>
>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>> L.S.,
>>>>>>>>
>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but 
>>>>>>>> they don't seem related) under Xen i the nasty splat below, 
>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>
>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting 
>>>>>>>> could be nasty due to another (networking related) kernel bug.
>>>>>>>>
>>>>>>>> If you need more info, want me to run a debug patch etc., please feel 
>>>>>>>> free to ask.
>>>>>>>>
>>>>>>> Thanks for the report. However I see no change in the r8169 driver 
>>>>>>> between
>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause 
>>>>>>> could
>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>
>>>>>> Hmm i did some diging and i think:
>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb 
>>>>>> barriers
>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more 
>>>>>> and __netdev_sent_queue
>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add 
>>>>>> __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>
>>>>> You're right. Thought this was added in 4.20 already.
>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't 
>>>>> heard about
>>>>> this issue from any user of physical hw. And due to the fact that a lot 
>>>>> of mainboards
>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>> Does the issue occur under specific circumstances like very high load?
>>>>
>>>> Yep, the box is already quite contented with the Xen VM's and if I 
>>>> remember correctly it occurred while kernel compiling
>>>> on the host.
>>>>
>>>>> If indeed the xmit_more patch causes the issue, I think we have to 
>>>>> involve Eric Dumazet
>>>>> as author of the underlying changes.
>>>>
>>>> It could also be the barriers weren't that unneeded as assumed.
>>>
>>> The barriers were removed after adding xmit_more handling. Therefore it 
>>> would be good to
>>> test also with only 
>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb 
>>> barriers
>>> removed.
>>>
>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>
>>> Sure, thanks.
>>>
>>>> BTW am i correct these patches are merely optimizations ?
>>>
>>> Yes
>>>
>>>> If so and concluding they revert cleanly, perhaps it should be considered 
>>>> at this point in the RC's
>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>
>>> Before removing both it would be good to test with only the barrier-removal 
>>> removed.
>>>
>>
>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more 
>> and __netdev_sent_queue
>> looks buggy to me, since the skb might have been freed already on another 
>> cpu when you call
>>
>> You could try :
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169.c 
>> b/drivers/net/ethernet/realtek/r8169.c
>> index 
>> 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93
>>  100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff 
>> *skb,
>> dma_addr_t mapping;
>> u32 opts[2], len;
>> bool stop_queue;
>> +   bool door_bell;
>> 

Linux 5.0 regression: BUG: unable to handle kernel paging request at ffff888023e26778

2019-02-09 Thread Sander Eikelenboom
L.S.,


While testing a Linux 5.0-rc5-ish kernel (pull of yesterday) with some 
additional patches for
already reported other issues i came across the issue below which i haven't 
seen with 4.20.x

I haven't got a reproducer so i might be hard to hit it again, 
system is AMD and this is from the host kernel running under
the Xen hypervisor might it matter.

--

Sander


[17035.016433] BUG: unable to handle kernel paging request at 888023e26778
[17035.025887] #PF error: [PROT] [WRITE]
[17035.035146] PGD 2a2a067 P4D 2a2a067 PUD 2a2b067 PMD 7fe01067 PTE 
801023e26065
[17035.044371] Oops: 0003 [#1] SMP NOPTI
[17035.053720] CPU: 3 PID: 28310 Comm: apt-get Not tainted 
5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
[17035.063440] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[17035.072635] RIP: e030:move_page_tables+0x7c1/0xae0
[17035.081585] Code: ce 00 48 8b 03 31 ff 48 89 44 24 20 e8 9e 72 e4 ff 66 90 
48 89 c6 48 89 df e8 8b 89 e4 ff 66 90 48 8b 44 24 20 b9 0c 00 00 00 <48> 89 45 
00 41 f6 46 52 40 0f 85 3f 02 00 00 49 8b 7e 40 45 31 c0
[17035.100225] RSP: e02b:c9f2bd40 EFLAGS: 00010282
[17035.109208] RAX: 000475e42067 RBX: 888023e267e0 RCX: 000c
[17035.118332] RDX:  RSI:  RDI: 0201
[17035.127378] RBP: 888023e26778 R08:  R09: 00051c1d9000
[17035.136310] R10: deadbeefdeadf00d R11: 88807fc17000 R12: 7fc59fa0
[17035.145433] R13: ea8f89a8 R14: 88801c2286c0 R15: 7fc59f80
[17035.154171] FS:  7fc5a5591100() GS:88807d4c() 
knlGS:
[17035.162730] CS:  e030 DS:  ES:  CR0: 80050033
[17035.171180] CR2: 888023e26778 CR3: 1c3f6000 CR4: 0660
[17035.179545] Call Trace:
[17035.187736]  move_vma.isra.3+0xd1/0x2d0
[17035.195837]  __se_sys_mremap+0x3c6/0x5b0
[17035.203986]  do_syscall_64+0x49/0x100
[17035.212109]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[17035.219971] RIP: 0033:0x7fc5a453527a
[17035.227558] Code: 73 01 c3 48 8b 0d 1e fc 2a 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 19 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d ee fb 2a 00 f7 d8 64 89 01 48
[17035.243255] RSP: 002b:7ffda22d96f8 EFLAGS: 0246 ORIG_RAX: 
0019
[17035.251121] RAX: ffda RBX: 557d40923a30 RCX: 7fc5a453527a
[17035.258986] RDX: 01a0 RSI: 0190 RDI: 7fc59f7ff000
[17035.267127] RBP: 01a0 R08: 0020 R09: 0040
[17035.275259] R10: 0001 R11: 0246 R12: 7fc59f7ff060
[17035.282681] R13: 7fc59f7ff000 R14: 557d40923a30 R15: 557d40829aa0
[17035.290322] Modules linked in:
[17035.297875] CR2: 888023e26778
[17035.305405] ---[ end trace 6ff49f09286816b6 ]---
[17035.313131] RIP: e030:move_page_tables+0x7c1/0xae0
[17035.320326] Code: ce 00 48 8b 03 31 ff 48 89 44 24 20 e8 9e 72 e4 ff 66 90 
48 89 c6 48 89 df e8 8b 89 e4 ff 66 90 48 8b 44 24 20 b9 0c 00 00 00 <48> 89 45 
00 41 f6 46 52 40 0f 85 3f 02 00 00 49 8b 7e 40 45 31 c0
[17035.334851] RSP: e02b:c9f2bd40 EFLAGS: 00010282
[17035.341727] RAX: 000475e42067 RBX: 888023e267e0 RCX: 000c
[17035.348838] RDX:  RSI:  RDI: 0201
[17035.356000] RBP: 888023e26778 R08:  R09: 00051c1d9000
[17035.363623] R10: deadbeefdeadf00d R11: 88807fc17000 R12: 7fc59fa0
[17035.371454] R13: ea8f89a8 R14: 88801c2286c0 R15: 7fc59f80
[17035.378958] FS:  7fc5a5591100() GS:88807d4c() 
knlGS:
[17035.386585] CS:  e030 DS:  ES:  CR0: 80050033
[17035.393797] CR2: 888023e26778 CR3: 1c3f6000 CR4: 0660





Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-08 Thread Sander Eikelenboom
On 08/02/2019 22:50, Heiner Kallweit wrote:
> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>> L.S.,
>>>>>>
>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they 
>>>>>> don't seem related) under Xen i the nasty splat below, 
>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>
>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting 
>>>>>> could be nasty due to another (networking related) kernel bug.
>>>>>>
>>>>>> If you need more info, want me to run a debug patch etc., please feel 
>>>>>> free to ask.
>>>>>>
>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause 
>>>>> could
>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>
>>>> Hmm i did some diging and i think:
>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb 
>>>> barriers
>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and 
>>>> __netdev_sent_queue
>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add 
>>>> __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>
>>> You're right. Thought this was added in 4.20 already.
>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't 
>>> heard about
>>> this issue from any user of physical hw. And due to the fact that a lot of 
>>> mainboards
>>> have onboard Realtek network I have quite a few testers out there.
>>> Does the issue occur under specific circumstances like very high load?
>>
>> Yep, the box is already quite contented with the Xen VM's and if I remember 
>> correctly it occurred while kernel compiling
>> on the host.
>>
>>> If indeed the xmit_more patch causes the issue, I think we have to involve 
>>> Eric Dumazet
>>> as author of the underlying changes.
>>
>> It could also be the barriers weren't that unneeded as assumed.
> 
> The barriers were removed after adding xmit_more handling. Therefore it would 
> be good to
> test also with only 
> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb 
> barriers
> removed.

*arghh* *grmbl*

with both:
bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3
and
2e6eedb4813e34d8d84ac0eb3afb668966f3f356 
reverted i get yet another splat:

[ 3769.246083] ld: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), 
nodemask=(null),cpuset=/,mems_allowed=0
[ 3769.246095] CPU: 2 PID: 3201 Comm: ld Not tainted 
5.0.0-rc5-20190208-thp-net-florian-rtl8169-doflr+ #1
[ 3769.246096] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[ 3769.246098] Call Trace:
[ 3769.246104]  
[ 3769.246114]  dump_stack+0x5c/0x7b
[ 3769.246120]  warn_alloc+0x103/0x190
[ 3769.246122]  __alloc_pages_nodemask+0xe3d/0xe80
[ 3769.246128]  ? inet_gro_receive+0x232/0x2c0
[ 3769.246130]  page_frag_alloc+0x117/0x150
[ 3769.246132]  __napi_alloc_skb+0x83/0xd0
[ 3769.246137]  rtl8169_poll+0x210/0x640
[ 3769.246140]  net_rx_action+0x23d/0x370
[ 3769.246145]  __do_softirq+0xed/0x229
[ 3769.246149]  irq_exit+0xb7/0xc0
[ 3769.246152]  xen_evtchn_do_upcall+0x27/0x40
[ 3769.246154]  xen_do_hypervisor_callback+0x29/0x40
[ 3769.246155]  
[ 3769.246161] RIP: e030:__pv_queued_spin_lock_slowpath+0xda/0x280
[ 3769.246163] Code: 14 41 bc 01 00 00 00 41 bd 00 01 00 00 3c 02 0f 94 c0 0f 
b6 c0 48 89 04 24 c6 45 14 00 ba 00 80 00 00 c6 43 01 01 eb 0b f3 90 <83> ea 01 
0f 84 49 01 00 00 0f b6 03 84 c0 75 ee 44 89 e8 f0 66 44
[ 3769.246164] RSP: e02b:c90005b0f780 EFLAGS: 0202
[ 3769.246166] RAX: 0001 RBX: 8880047c9200 RCX: 0001
[ 3769.246167] RDX: 7d75 RSI:  RDI: 8880047c9200
[ 3769.246167] RBP: 88807d4a1a80 R08: c90005b0f978 R09: c90005b0f978
[ 3769.246168] R10: c90005b0f9d0 R11: 88807fc17000 R12: 0001
[ 3769.246169] R13: 0100 R14:  R15: 000c
[ 3769.246173]  _raw_spin_lock+0x16/0x20
[ 3769.246176]  list_lru_add+0x59/0x170
[ 3769.246179]  inode_lru_list_add+0x1b/0x40
[ 3769.246182]  iput+0x18b/0x1a0
[ 3769.246184]  __dentry_kill+0xc5/0x170
[ 3769.246186]  shrink_dentry_list+0

Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-08 Thread Sander Eikelenboom
On 08/02/2019 22:22, Heiner Kallweit wrote:
> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>> L.S.,
>>>>
>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they 
>>>> don't seem related) under Xen i the nasty splat below, 
>>>> that I haven encountered with Linux 4.20.x.
>>>>
>>>> Unfortunately I haven't got a clear reproducer for this and bisecting 
>>>> could be nasty due to another (networking related) kernel bug.
>>>>
>>>> If you need more info, want me to run a debug patch etc., please feel free 
>>>> to ask.
>>>>
>>> Thanks for the report. However I see no change in the r8169 driver between
>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>
>> Hmm i did some diging and i think:
>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb 
>> barriers
>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and 
>> __netdev_sent_queue
>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue 
>> as variant of __netdev_tx_sent_queue
>>
> You're right. Thought this was added in 4.20 already.
> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard 
> about
> this issue from any user of physical hw. And due to the fact that a lot of 
> mainboards
> have onboard Realtek network I have quite a few testers out there.
> Does the issue occur under specific circumstances like very high load?

Yep, the box is already quite contented with the Xen VM's and if I remember 
correctly it occurred while kernel compiling
on the host.

> If indeed the xmit_more patch causes the issue, I think we have to involve 
> Eric Dumazet
> as author of the underlying changes.

It could also be the barriers weren't that unneeded as assumed.
Since we are almost at RC6 i took the liberty to CC Eric now.

BTW am i correct these patches are merely optimizations ?
If so and concluding they revert cleanly, perhaps it should be considered at 
this point in the RC's
to revert them for 5.0 and try again for 5.1 ?

--
Sander


> 
>> would be candidates, which were merged in 5.0.
>>
>> I have reverted the first two, see how that works out.
>>
>> --
>> Sander
>>
> Heiner
> 
>>  
>>>> --
>>>> Sander
>>>>
>>> Heiner
>>>
>>>>
>>>> [ 6466.554866] kernel BUG at lib/dynamic_queue_limits.c:27!
>>>> [ 6466.571425] invalid opcode:  [#1] SMP NOPTI
>>>> [ 6466.585890] CPU: 3 PID: 7057 Comm: as Not tainted 
>>>> 5.0.0-rc5-20190208-thp-net-florian-doflr+ #1
>>>> [ 6466.598693] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>>>> V1.8B1 09/13/2010
>>>> [ 6466.611579] RIP: e030:dql_completed+0x126/0x140
>>>> [ 6466.624339] Code: 2b 47 54 ba 00 00 00 00 c7 47 54 ff ff ff ff 0f 48 c2 
>>>> 48 8b 15 7b 39 4a 01 48 89 57 58 e9 48 ff ff ff 44 89 c0 e9 40 ff ff ff 
>>>> <0f> 0b 8b 47 50 29 e8 41 0f 48 c3 eb 9f 90 90 90 90 90 90 90 90 90
>>>> [ 6466.648130] RSP: e02b:88807d4c3e78 EFLAGS: 00010297
>>>> [ 6466.659616] RAX: 0042 RBX: 8880049cf800 RCX: 
>>>> 
>>>> [ 6466.672835] RDX: 0001 RSI: 0042 RDI: 
>>>> 8880049cf8c0
>>>> [ 6466.684521] RBP: 888077df7260 R08: 0001 R09: 
>>>> 
>>>> [ 6466.696824] R10: 387c2336 R11: 387c2336 R12: 
>>>> 1000
>>>> [ 6466.709953] R13: 888077df6898 R14: 888077df75c0 R15: 
>>>> 00454677
>>>> [ 6466.722165] FS:  7fd869147200() GS:88807d4c() 
>>>> knlGS:
>>>> [ 6466.733228] CS:  e030 DS:  ES:  CR0: 80050033
>>>> [ 6466.746581] CR2: 7fd867dfd000 CR3: 74884000 CR4: 
>>>> 0660
>>>> [ 6466.758366] Call Trace:
>>>> [ 6466.768118]  
>>>> [ 6466.778214]  rtl8169_poll+0x4f4/0x640
>>>> [ 6466.789198]  net_rx_action+0x23d/0x370
>>>> [ 6466.798467]  __do_softirq+0xed/0x229
>>>> [ 6466.807039]  irq_exit+0xb7/0xc0
>>>> [ 6466.815471]  xen_evtchn_do_upcall+0x27/0x40
>>>> [ 6466.826647]  xen_do_hypervisor_callback+0x29/0x40
>>

Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-08 Thread Sander Eikelenboom
On 08/02/2019 19:52, Heiner Kallweit wrote:
> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>> L.S.,
>>
>> While testing a linux 5.0-rc5 kernel (with some patches on top but they 
>> don't seem related) under Xen i the nasty splat below, 
>> that I haven encountered with Linux 4.20.x.
>>
>> Unfortunately I haven't got a clear reproducer for this and bisecting could 
>> be nasty due to another (networking related) kernel bug.
>>
>> If you need more info, want me to run a debug patch etc., please feel free 
>> to ask.
>>
> Thanks for the report. However I see no change in the r8169 driver between
> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
> be somewhere else. Therefore I'm afraid a bisect will be needed.

Hmm i did some diging and i think:
bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and 
__netdev_sent_queue
620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as 
variant of __netdev_tx_sent_queue

would be candidates, which were merged in 5.0.

I have reverted the first two, see how that works out.

--
Sander

 
>> --
>> Sander
>>
> Heiner
> 
>>
>> [ 6466.554866] kernel BUG at lib/dynamic_queue_limits.c:27!
>> [ 6466.571425] invalid opcode:  [#1] SMP NOPTI
>> [ 6466.585890] CPU: 3 PID: 7057 Comm: as Not tainted 
>> 5.0.0-rc5-20190208-thp-net-florian-doflr+ #1
>> [ 6466.598693] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>> V1.8B1 09/13/2010
>> [ 6466.611579] RIP: e030:dql_completed+0x126/0x140
>> [ 6466.624339] Code: 2b 47 54 ba 00 00 00 00 c7 47 54 ff ff ff ff 0f 48 c2 
>> 48 8b 15 7b 39 4a 01 48 89 57 58 e9 48 ff ff ff 44 89 c0 e9 40 ff ff ff <0f> 
>> 0b 8b 47 50 29 e8 41 0f 48 c3 eb 9f 90 90 90 90 90 90 90 90 90
>> [ 6466.648130] RSP: e02b:88807d4c3e78 EFLAGS: 00010297
>> [ 6466.659616] RAX: 0042 RBX: 8880049cf800 RCX: 
>> 
>> [ 6466.672835] RDX: 0001 RSI: 0042 RDI: 
>> 8880049cf8c0
>> [ 6466.684521] RBP: 888077df7260 R08: 0001 R09: 
>> 
>> [ 6466.696824] R10: 387c2336 R11: 387c2336 R12: 
>> 1000
>> [ 6466.709953] R13: 888077df6898 R14: 888077df75c0 R15: 
>> 00454677
>> [ 6466.722165] FS:  7fd869147200() GS:88807d4c() 
>> knlGS:
>> [ 6466.733228] CS:  e030 DS:  ES:  CR0: 80050033
>> [ 6466.746581] CR2: 7fd867dfd000 CR3: 74884000 CR4: 
>> 0660
>> [ 6466.758366] Call Trace:
>> [ 6466.768118]  
>> [ 6466.778214]  rtl8169_poll+0x4f4/0x640
>> [ 6466.789198]  net_rx_action+0x23d/0x370
>> [ 6466.798467]  __do_softirq+0xed/0x229
>> [ 6466.807039]  irq_exit+0xb7/0xc0
>> [ 6466.815471]  xen_evtchn_do_upcall+0x27/0x40
>> [ 6466.826647]  xen_do_hypervisor_callback+0x29/0x40
>> [ 6466.835902]  
>> [ 6466.845361] RIP: e030:xen_hypercall_mmu_update+0xa/0x20
>> [ 6466.853390] Code: 51 41 53 b8 00 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc 
>> cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 01 00 00 00 0f 05 <41> 
>> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
>> [ 6466.874031] RSP: e02b:c90003c0bdd0 EFLAGS: 0246
>> [ 6466.883452] RAX:  RBX: 00041f83bfe8 RCX: 
>> 8100102a
>> [ 6466.891986] RDX: deadbeefdeadf00d RSI: deadbeefdeadf00d RDI: 
>> deadbeefdeadf00d
>> [ 6466.903402] RBP: 0fe8 R08: 000b R09: 
>> 
>> [ 6466.911201] R10: deadbeefdeadf00d R11: 0246 R12: 
>> 80050c346067
>> [ 6466.918491] R13: 8880607c4fe8 R14: 888005082800 R15: 
>> 
>> [ 6466.926647]  ? xen_hypercall_mmu_update+0xa/0x20
>> [ 6466.938195]  ? xen_set_pte_at+0x78/0xe0
>> [ 6466.947046]  ? __handle_mm_fault+0xc43/0x1060
>> [ 6466.955772]  ? do_mmap+0x44b/0x5b0
>> [ 6466.964410]  ? handle_mm_fault+0xf8/0x200
>> [ 6466.973290]  ? __do_page_fault+0x231/0x4a0
>> [ 6466.981973]  ? page_fault+0x8/0x30
>> [ 6466.990904]  ? page_fault+0x1e/0x30
>> [ 6466.999585] Modules linked in:
>> [ 6467.007533] ---[ end trace 94bec01608fe4061 ]---
>> [ 6467.016751] RIP: e030:dql_completed+0x126/0x140
>> [ 6467.024271] Code: 2b 47 54 ba 00 00 00 00 c7 47 54 ff ff ff ff 0f 48 c2 
>> 48 8b 15 7b 39 4a 01 48 89 57 58 e9 48 ff ff ff 44 89 c0 e9 40 ff ff ff <0f> 
>> 0b 8b 47 50 29 e8 41 0f 48 c3 eb 9f 90 90 90 90 90 90 90 90 90
>&g

Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!

2019-02-08 Thread Sander Eikelenboom
L.S.,

While testing a linux 5.0-rc5 kernel (with some patches on top but they don't 
seem related) under Xen i the nasty splat below, 
that I haven encountered with Linux 4.20.x.

Unfortunately I haven't got a clear reproducer for this and bisecting could be 
nasty due to another (networking related) kernel bug.

If you need more info, want me to run a debug patch etc., please feel free to 
ask.

--
Sander


[ 6466.554866] kernel BUG at lib/dynamic_queue_limits.c:27!
[ 6466.571425] invalid opcode:  [#1] SMP NOPTI
[ 6466.585890] CPU: 3 PID: 7057 Comm: as Not tainted 
5.0.0-rc5-20190208-thp-net-florian-doflr+ #1
[ 6466.598693] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[ 6466.611579] RIP: e030:dql_completed+0x126/0x140
[ 6466.624339] Code: 2b 47 54 ba 00 00 00 00 c7 47 54 ff ff ff ff 0f 48 c2 48 
8b 15 7b 39 4a 01 48 89 57 58 e9 48 ff ff ff 44 89 c0 e9 40 ff ff ff <0f> 0b 8b 
47 50 29 e8 41 0f 48 c3 eb 9f 90 90 90 90 90 90 90 90 90
[ 6466.648130] RSP: e02b:88807d4c3e78 EFLAGS: 00010297
[ 6466.659616] RAX: 0042 RBX: 8880049cf800 RCX: 
[ 6466.672835] RDX: 0001 RSI: 0042 RDI: 8880049cf8c0
[ 6466.684521] RBP: 888077df7260 R08: 0001 R09: 
[ 6466.696824] R10: 387c2336 R11: 387c2336 R12: 1000
[ 6466.709953] R13: 888077df6898 R14: 888077df75c0 R15: 00454677
[ 6466.722165] FS:  7fd869147200() GS:88807d4c() 
knlGS:
[ 6466.733228] CS:  e030 DS:  ES:  CR0: 80050033
[ 6466.746581] CR2: 7fd867dfd000 CR3: 74884000 CR4: 0660
[ 6466.758366] Call Trace:
[ 6466.768118]  
[ 6466.778214]  rtl8169_poll+0x4f4/0x640
[ 6466.789198]  net_rx_action+0x23d/0x370
[ 6466.798467]  __do_softirq+0xed/0x229
[ 6466.807039]  irq_exit+0xb7/0xc0
[ 6466.815471]  xen_evtchn_do_upcall+0x27/0x40
[ 6466.826647]  xen_do_hypervisor_callback+0x29/0x40
[ 6466.835902]  
[ 6466.845361] RIP: e030:xen_hypercall_mmu_update+0xa/0x20
[ 6466.853390] Code: 51 41 53 b8 00 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc 
cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 01 00 00 00 0f 05 <41> 5b 59 
c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[ 6466.874031] RSP: e02b:c90003c0bdd0 EFLAGS: 0246
[ 6466.883452] RAX:  RBX: 00041f83bfe8 RCX: 8100102a
[ 6466.891986] RDX: deadbeefdeadf00d RSI: deadbeefdeadf00d RDI: deadbeefdeadf00d
[ 6466.903402] RBP: 0fe8 R08: 000b R09: 
[ 6466.911201] R10: deadbeefdeadf00d R11: 0246 R12: 80050c346067
[ 6466.918491] R13: 8880607c4fe8 R14: 888005082800 R15: 
[ 6466.926647]  ? xen_hypercall_mmu_update+0xa/0x20
[ 6466.938195]  ? xen_set_pte_at+0x78/0xe0
[ 6466.947046]  ? __handle_mm_fault+0xc43/0x1060
[ 6466.955772]  ? do_mmap+0x44b/0x5b0
[ 6466.964410]  ? handle_mm_fault+0xf8/0x200
[ 6466.973290]  ? __do_page_fault+0x231/0x4a0
[ 6466.981973]  ? page_fault+0x8/0x30
[ 6466.990904]  ? page_fault+0x1e/0x30
[ 6466.999585] Modules linked in:
[ 6467.007533] ---[ end trace 94bec01608fe4061 ]---
[ 6467.016751] RIP: e030:dql_completed+0x126/0x140
[ 6467.024271] Code: 2b 47 54 ba 00 00 00 00 c7 47 54 ff ff ff ff 0f 48 c2 48 
8b 15 7b 39 4a 01 48 89 57 58 e9 48 ff ff ff 44 89 c0 e9 40 ff ff ff <0f> 0b 8b 
47 50 29 e8 41 0f 48 c3 eb 9f 90 90 90 90 90 90 90 90 90
[ 6467.039726] RSP: e02b:88807d4c3e78 EFLAGS: 00010297
[ 6467.047243] RAX: 0042 RBX: 8880049cf800 RCX: 
[ 6467.054202] RDX: 0001 RSI: 0042 RDI: 8880049cf8c0
[ 6467.062000] RBP: 888077df7260 R08: 0001 R09: 
[ 6467.069664] R10: 387c2336 R11: 387c2336 R12: 1000
[ 6467.077715] R13: 888077df6898 R14: 888077df75c0 R15: 00454677
[ 6467.084916] FS:  7fd869147200() GS:88807d4c() 
knlGS:
[ 6467.093352] CS:  e030 DS:  ES:  CR0: 80050033
[ 6467.101492] CR2: 7fd867dfd000 CR3: 74884000 CR4: 0660
[ 6467.110542] Kernel panic - not syncing: Fatal exception in interrupt
[ 6467.118166] Kernel Offset: disabled
(XEN) [2019-02-08 18:04:48.854] Hardware Dom0 crashed: rebooting machine in 5 
seconds.


Re: Kernel 5.0-rc5 regression with NAT, bisected to: netfilter: nat: remove l4proto->manip_pkt

2019-02-08 Thread Sander Eikelenboom
On 08/02/2019 12:54, Florian Westphal wrote:
> Florian Westphal  wrote:
>> Sander Eikelenboom  wrote:
>>> L.S.,
>>>
>>> While trying out a 5.0-RC5 kernel I seem to have stumbled over a regression 
>>> with NAT.
>>> (using an nftables firewall with NAT and connection tracking).
>>>
>>> Unfortunately it isn't too obvious since no errors are logged, but on 
>>> clients it
>>> causes symptoms like firefox intermittently not being able to load pages 
>>> with:
>>> Network Protocol Error
>>> An error occurred during a connection to www.example.com
>>> The page you are trying to view cannot be shown because an error in the 
>>> network protocol was detected.
>>> Please contact the website owners to inform them of this problem.
>>>
>>> But it's only intermittently, so i can still visit some webpages with 
>>> clients, 
>>> could be that packet size and or fragments are at play ?
>>>
>>> So I tried testing with 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git with 
>>> e8c32c32b48c2e889704d8ca0872f92eb027838e as last commit, to be sure to have 
>>> the latest netdev has to offer,
>>> but to no avail. 
>>>
>>> After that I tried to git bisect and ended up with:
>>>
>>> faec18dbb0405c7d4dda025054511dc3a6696918 is the first bad commit
>>> commit faec18dbb0405c7d4dda025054511dc3a6696918
>>> Author: Florian Westphal 
>>> Date:   Thu Dec 13 16:01:33 2018 +0100
>>>
>>> netfilter: nat: remove l4proto->manip_pkt
>>
>> Thanks, this is immensely helpful.
>>
>> I think I see the bug, we can't use target->dst.protonum in
>> nf_nat_l4proto_manip_pkt(), it will be TCP in case we're dealing
>> with a related icmp packet.
>>
>> I will send a patch in a few hours when I get back.
> 
> Sander, does this patch fix things for you?

Hi Florian,

You may stick on a reported/tested-by if you like.
Thanks for the swift fix !

--
Sander

> 
> Thanks!
> 
> diff --git a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c 
> b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
> --- a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
> +++ b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
> @@ -215,6 +215,7 @@ int nf_nat_icmp_reply_translation(struct sk_buff *skb,
>  
>   /* Change outer to look like the reply to an incoming packet */
>   nf_ct_invert_tuplepr(, >tuplehash[!dir].tuple);
> + target.dst.protonum = IPPROTO_ICMP;
>   if (!nf_nat_ipv4_manip_pkt(skb, 0, , manip))
>   return 0;
>  
> diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c 
> b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> --- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> +++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> @@ -226,6 +226,7 @@ int nf_nat_icmpv6_reply_translation(struct sk_buff *skb,
>   }
>  
>   nf_ct_invert_tuplepr(, >tuplehash[!dir].tuple);
> + target.dst.protonum = IPPROTO_ICMPV6;
>   if (!nf_nat_ipv6_manip_pkt(skb, 0, , manip))
>   return 0;
>  
> 



Kernel 5.0-rc5 regression with NAT, bisected to: netfilter: nat: remove l4proto->manip_pkt

2019-02-07 Thread Sander Eikelenboom
L.S.,

While trying out a 5.0-RC5 kernel I seem to have stumbled over a regression 
with NAT.
(using an nftables firewall with NAT and connection tracking).

Unfortunately it isn't too obvious since no errors are logged, but on clients it
causes symptoms like firefox intermittently not being able to load pages with:
Network Protocol Error
An error occurred during a connection to www.example.com
The page you are trying to view cannot be shown because an error in the 
network protocol was detected.
Please contact the website owners to inform them of this problem.

But it's only intermittently, so i can still visit some webpages with clients, 
could be that packet size and or fragments are at play ?

So I tried testing with 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git with 
e8c32c32b48c2e889704d8ca0872f92eb027838e as last commit, to be sure to have the 
latest netdev has to offer,
but to no avail. 

After that I tried to git bisect and ended up with:

faec18dbb0405c7d4dda025054511dc3a6696918 is the first bad commit
commit faec18dbb0405c7d4dda025054511dc3a6696918
Author: Florian Westphal 
Date:   Thu Dec 13 16:01:33 2018 +0100

netfilter: nat: remove l4proto->manip_pkt

This removes the last l4proto indirection, the two callers, the l3proto
packet mangling helpers for ipv4 and ipv6, now call the
nf_nat_l4proto_manip_pkt() helper.

nf_nat_proto_{dccp,tcp,sctp,gre,icmp,icmpv6} are left behind, even though
they contain no functionality anymore to not clutter this patch.

Next patch will remove the empty files and the nf_nat_l4proto
struct.

nf_nat_proto_udp.c is renamed to nf_nat_proto.c, as it now contains the
other nat manip functionality as well, not just udp and udplite.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 

:04 04 22d8706921e03cbd6d78a6ebcc5f253ccfd2bf0c 
b6f8ab2779215b4495dfe641f50e798da73859ac M  include
:04 04 af212a756f1acf00cbe45c3be5b71f38f01f1d34 
165c440f9e6f2e05738628a19b51f7603f95752a M  net

Any ideas or debugging hints ?

--
Sander


Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 23:48, Boris Ostrovsky wrote:
> On 9/27/18 5:37 PM, Jens Axboe wrote:
>> On 9/27/18 2:33 PM, Sander Eikelenboom wrote:
>>> On 27/09/18 21:06, Boris Ostrovsky wrote:
>>>> On 9/27/18 2:56 PM, Jens Axboe wrote:
>>>>> On 9/27/18 12:52 PM, Sander Eikelenboom wrote:
>>>>>> On 27/09/18 16:26, Jens Axboe wrote:
>>>>>>> On 9/27/18 1:12 AM, Juergen Gross wrote:
>>>>>>>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>>>>>>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>>>>> added support for purging persistent grants when they are not in use. 
>>>>>>>>> As
>>>>>>>>> part of the purge, the grants were removed from the grant buffer, This
>>>>>>>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>>>>>>>> get_free_grant(). This can be observed even on an idle system, within
>>>>>>>>> 20-30 minutes.
>>>>>>>>>
>>>>>>>>> We should keep the grants in the buffer when purging, and only free 
>>>>>>>>> the
>>>>>>>>> grant ref.
>>>>>>>>>
>>>>>>>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>>>>> Signed-off-by: Boris Ostrovsky 
>>>>>>>> Reviewed-by: Juergen Gross 
>>>>>>> Since Konrad is out, I'm going to queue this up for 4.19.
>>>>>>>
>>>>>> Hi Boris/Juergen.
>>>>>>
>>>>>> Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch 
>>>>>> from Boris pulled on top. 
>>>>>> Unfortunately it made a VM hang (probably because it's rootFS is 
>>>>>> shuffled from under it's feet 
>>>> What do you mean by "rootFS is shuffled from under it's feet " ?
>>> Assumption that block-front getting borked and either a kernel crash or 
>>> rootfs becoming mounted readonly. Didn't (try) to check though.
>>>
>>>>>> and it gave these in dom0 dmesg:
>>>>>>
>>>>>> [ 9251.696090] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.705861] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>> [ 9251.715781] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.725756] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>> [ 9251.735698] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.745573] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>>
>>>>>> The VM was a HVM with 4 vcpu's and 2 phy disks:
>>>>>> xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
>>>>>> persistent grants
>>>>>> xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
>>>>>> persistent grants
>>>>>>
>>>>>>
>>>>>> Currently i have been running 4.19-rc5 with xen-next on top and commit
>>>>>> a46b53672b2c reverted, for a couple of days. That seems to run stable
>>>>>> for me (since it's a small box so i'm not hit by what a46b53672b2c
>>>>>> tried to fix.
>>>>>>
>>>>>> If you can come up with a debug patch i can give that a spin tomorrow
>>>>>> evening or in the weekend, so we are hopefully still in time for the
>>>>>> 4.19 release.
>>>>> At this late in the game, might make more sense to simply revert the
>>>>> buggy commit.  Especially since what is currently out there doesn't fix
>>>>> the issue for you.
>>> Don't know if Boris or Juergen have a hunch about the issue, if not
>>> perhaps a revert is the best.
>> Anyone? Unless I hear otherwise, I'll revert the series tomorrow.
> 
> Juergen may have something to say by tomorrow, but from my perspective,
> given that we are coming up on rc6 --- yes.
> 
> I looked at the patches again and didn't see anything obvious.
> 
> -boris

Could also be that what i hit is a latent bug, 
that is not caused by these patches but merely got uncovered by them.

xl dmesg also shows quite some:
(XEN) [2018-09-24 03:15:46.847] grant_table.c:1755:d14v0 Expanding d14 
grant table from 19 to 20 frames
(XEN) [2018-09-24 03:15:46.849] grant_table.c:1755:d14v0 Expanding d14 
grant table from 20 to 21 frames
(and has done that for ages on my box not leading to any direct problems to my 
knowledge)

I don't know if there could be related and something around the (persistent) 
grants for block devices could be leaking under some conditions?

--
Sander



Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 23:48, Boris Ostrovsky wrote:
> On 9/27/18 5:37 PM, Jens Axboe wrote:
>> On 9/27/18 2:33 PM, Sander Eikelenboom wrote:
>>> On 27/09/18 21:06, Boris Ostrovsky wrote:
>>>> On 9/27/18 2:56 PM, Jens Axboe wrote:
>>>>> On 9/27/18 12:52 PM, Sander Eikelenboom wrote:
>>>>>> On 27/09/18 16:26, Jens Axboe wrote:
>>>>>>> On 9/27/18 1:12 AM, Juergen Gross wrote:
>>>>>>>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>>>>>>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>>>>> added support for purging persistent grants when they are not in use. 
>>>>>>>>> As
>>>>>>>>> part of the purge, the grants were removed from the grant buffer, This
>>>>>>>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>>>>>>>> get_free_grant(). This can be observed even on an idle system, within
>>>>>>>>> 20-30 minutes.
>>>>>>>>>
>>>>>>>>> We should keep the grants in the buffer when purging, and only free 
>>>>>>>>> the
>>>>>>>>> grant ref.
>>>>>>>>>
>>>>>>>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>>>>> Signed-off-by: Boris Ostrovsky 
>>>>>>>> Reviewed-by: Juergen Gross 
>>>>>>> Since Konrad is out, I'm going to queue this up for 4.19.
>>>>>>>
>>>>>> Hi Boris/Juergen.
>>>>>>
>>>>>> Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch 
>>>>>> from Boris pulled on top. 
>>>>>> Unfortunately it made a VM hang (probably because it's rootFS is 
>>>>>> shuffled from under it's feet 
>>>> What do you mean by "rootFS is shuffled from under it's feet " ?
>>> Assumption that block-front getting borked and either a kernel crash or 
>>> rootfs becoming mounted readonly. Didn't (try) to check though.
>>>
>>>>>> and it gave these in dom0 dmesg:
>>>>>>
>>>>>> [ 9251.696090] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.705861] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>> [ 9251.715781] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.725756] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>> [ 9251.735698] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.745573] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>>
>>>>>> The VM was a HVM with 4 vcpu's and 2 phy disks:
>>>>>> xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
>>>>>> persistent grants
>>>>>> xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
>>>>>> persistent grants
>>>>>>
>>>>>>
>>>>>> Currently i have been running 4.19-rc5 with xen-next on top and commit
>>>>>> a46b53672b2c reverted, for a couple of days. That seems to run stable
>>>>>> for me (since it's a small box so i'm not hit by what a46b53672b2c
>>>>>> tried to fix.
>>>>>>
>>>>>> If you can come up with a debug patch i can give that a spin tomorrow
>>>>>> evening or in the weekend, so we are hopefully still in time for the
>>>>>> 4.19 release.
>>>>> At this late in the game, might make more sense to simply revert the
>>>>> buggy commit.  Especially since what is currently out there doesn't fix
>>>>> the issue for you.
>>> Don't know if Boris or Juergen have a hunch about the issue, if not
>>> perhaps a revert is the best.
>> Anyone? Unless I hear otherwise, I'll revert the series tomorrow.
> 
> Juergen may have something to say by tomorrow, but from my perspective,
> given that we are coming up on rc6 --- yes.
> 
> I looked at the patches again and didn't see anything obvious.
> 
> -boris

Could also be that what i hit is a latent bug, 
that is not caused by these patches but merely got uncovered by them.

xl dmesg also shows quite some:
(XEN) [2018-09-24 03:15:46.847] grant_table.c:1755:d14v0 Expanding d14 
grant table from 19 to 20 frames
(XEN) [2018-09-24 03:15:46.849] grant_table.c:1755:d14v0 Expanding d14 
grant table from 20 to 21 frames
(and has done that for ages on my box not leading to any direct problems to my 
knowledge)

I don't know if there could be related and something around the (persistent) 
grants for block devices could be leaking under some conditions?

--
Sander



Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 21:06, Boris Ostrovsky wrote:
> On 9/27/18 2:56 PM, Jens Axboe wrote:
>> On 9/27/18 12:52 PM, Sander Eikelenboom wrote:
>>> On 27/09/18 16:26, Jens Axboe wrote:
>>>> On 9/27/18 1:12 AM, Juergen Gross wrote:
>>>>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>>>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>> added support for purging persistent grants when they are not in use. As
>>>>>> part of the purge, the grants were removed from the grant buffer, This
>>>>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>>>>> get_free_grant(). This can be observed even on an idle system, within
>>>>>> 20-30 minutes.
>>>>>>
>>>>>> We should keep the grants in the buffer when purging, and only free the
>>>>>> grant ref.
>>>>>>
>>>>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>> Signed-off-by: Boris Ostrovsky 
>>>>> Reviewed-by: Juergen Gross 
>>>> Since Konrad is out, I'm going to queue this up for 4.19.
>>>>
>>> Hi Boris/Juergen.
>>>
>>> Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch 
>>> from Boris pulled on top. 
>>> Unfortunately it made a VM hang (probably because it's rootFS is shuffled 
>>> from under it's feet 
> 
> What do you mean by "rootFS is shuffled from under it's feet " ?

Assumption that block-front getting borked and either a kernel crash or rootfs 
becoming mounted readonly. Didn't (try) to check though.

>>> and it gave these in dom0 dmesg:
>>>
>>> [ 9251.696090] xen-blkback: requesting a grant already in use
>>> [ 9251.705861] xen-blkback: trying to add a gref that's already in the tree
>>> [ 9251.715781] xen-blkback: requesting a grant already in use
>>> [ 9251.725756] xen-blkback: trying to add a gref that's already in the tree
>>> [ 9251.735698] xen-blkback: requesting a grant already in use
>>> [ 9251.745573] xen-blkback: trying to add a gref that's already in the tree
>>>
>>> The VM was a HVM with 4 vcpu's and 2 phy disks:
>>> xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
>>> persistent grants
>>> xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
>>> persistent grants
>>>
>>>
>>> Currently i have been running 4.19-rc5 with xen-next on top and commit
>>> a46b53672b2c reverted, for a couple of days. That seems to run stable
>>> for me (since it's a small box so i'm not hit by what a46b53672b2c
>>> tried to fix.
>>>
>>> If you can come up with a debug patch i can give that a spin tomorrow
>>> evening or in the weekend, so we are hopefully still in time for the
>>> 4.19 release.
>> At this late in the game, might make more sense to simply revert the
>> buggy commit.  Especially since what is currently out there doesn't fix
>> the issue for you.
Don't know if Boris or Juergen have a hunch about the issue, if not perhaps a 
revert is the best. 

> If decision is to revert then I think the whole series needs to be
> reverted.
> 
> -boris
> 

For Boris and Juergen:
Would it make sense to have an "xen-next" branch in the xen-tip tree that is:
- based on the previous stable kernel
- and has the for-linus branches for the upcoming kernel release on top;
- and has the pathes for net(-next) and block changes on top (since these don't 
go via the tree but only via mailing-list patches);
  (which are scattered, difficult to track and use for automated testing)
- and dependency patches for the above if necessary to be able to build.

So there is one branch that can be used to test ALL pending kernel related Xen 
patches and which could be used in OSStest without as
many potential false alarms as linux-next will have ?

--
Sander


Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 21:06, Boris Ostrovsky wrote:
> On 9/27/18 2:56 PM, Jens Axboe wrote:
>> On 9/27/18 12:52 PM, Sander Eikelenboom wrote:
>>> On 27/09/18 16:26, Jens Axboe wrote:
>>>> On 9/27/18 1:12 AM, Juergen Gross wrote:
>>>>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>>>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>> added support for purging persistent grants when they are not in use. As
>>>>>> part of the purge, the grants were removed from the grant buffer, This
>>>>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>>>>> get_free_grant(). This can be observed even on an idle system, within
>>>>>> 20-30 minutes.
>>>>>>
>>>>>> We should keep the grants in the buffer when purging, and only free the
>>>>>> grant ref.
>>>>>>
>>>>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>> Signed-off-by: Boris Ostrovsky 
>>>>> Reviewed-by: Juergen Gross 
>>>> Since Konrad is out, I'm going to queue this up for 4.19.
>>>>
>>> Hi Boris/Juergen.
>>>
>>> Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch 
>>> from Boris pulled on top. 
>>> Unfortunately it made a VM hang (probably because it's rootFS is shuffled 
>>> from under it's feet 
> 
> What do you mean by "rootFS is shuffled from under it's feet " ?

Assumption that block-front getting borked and either a kernel crash or rootfs 
becoming mounted readonly. Didn't (try) to check though.

>>> and it gave these in dom0 dmesg:
>>>
>>> [ 9251.696090] xen-blkback: requesting a grant already in use
>>> [ 9251.705861] xen-blkback: trying to add a gref that's already in the tree
>>> [ 9251.715781] xen-blkback: requesting a grant already in use
>>> [ 9251.725756] xen-blkback: trying to add a gref that's already in the tree
>>> [ 9251.735698] xen-blkback: requesting a grant already in use
>>> [ 9251.745573] xen-blkback: trying to add a gref that's already in the tree
>>>
>>> The VM was a HVM with 4 vcpu's and 2 phy disks:
>>> xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
>>> persistent grants
>>> xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
>>> persistent grants
>>>
>>>
>>> Currently i have been running 4.19-rc5 with xen-next on top and commit
>>> a46b53672b2c reverted, for a couple of days. That seems to run stable
>>> for me (since it's a small box so i'm not hit by what a46b53672b2c
>>> tried to fix.
>>>
>>> If you can come up with a debug patch i can give that a spin tomorrow
>>> evening or in the weekend, so we are hopefully still in time for the
>>> 4.19 release.
>> At this late in the game, might make more sense to simply revert the
>> buggy commit.  Especially since what is currently out there doesn't fix
>> the issue for you.
Don't know if Boris or Juergen have a hunch about the issue, if not perhaps a 
revert is the best. 

> If decision is to revert then I think the whole series needs to be
> reverted.
> 
> -boris
> 

For Boris and Juergen:
Would it make sense to have an "xen-next" branch in the xen-tip tree that is:
- based on the previous stable kernel
- and has the for-linus branches for the upcoming kernel release on top;
- and has the pathes for net(-next) and block changes on top (since these don't 
go via the tree but only via mailing-list patches);
  (which are scattered, difficult to track and use for automated testing)
- and dependency patches for the above if necessary to be able to build.

So there is one branch that can be used to test ALL pending kernel related Xen 
patches and which could be used in OSStest without as
many potential false alarms as linux-next will have ?

--
Sander


Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 16:26, Jens Axboe wrote:
> On 9/27/18 1:12 AM, Juergen Gross wrote:
>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>> added support for purging persistent grants when they are not in use. As
>>> part of the purge, the grants were removed from the grant buffer, This
>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>> get_free_grant(). This can be observed even on an idle system, within
>>> 20-30 minutes.
>>>
>>> We should keep the grants in the buffer when purging, and only free the
>>> grant ref.
>>>
>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>> Signed-off-by: Boris Ostrovsky 
>>
>> Reviewed-by: Juergen Gross 
> 
> Since Konrad is out, I'm going to queue this up for 4.19.
> 

Hi Boris/Juergen.

Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch from 
Boris pulled on top. 
Unfortunately it made a VM hang (probably because it's rootFS is shuffled from 
under it's feet 
and it gave these in dom0 dmesg:

[ 9251.696090] xen-blkback: requesting a grant already in use
[ 9251.705861] xen-blkback: trying to add a gref that's already in the tree
[ 9251.715781] xen-blkback: requesting a grant already in use
[ 9251.725756] xen-blkback: trying to add a gref that's already in the tree
[ 9251.735698] xen-blkback: requesting a grant already in use
[ 9251.745573] xen-blkback: trying to add a gref that's already in the tree

The VM was a HVM with 4 vcpu's and 2 phy disks:
xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
persistent grants
xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
persistent grants


Currently i have been running 4.19-rc5 with xen-next on top and commit 
a46b53672b2c reverted,
for a couple of days. That seems to run stable for me (since it's a small box 
so i'm not hit
by what a46b53672b2c tried to fix.

If you can come up with a debug patch i can give that a spin tomorrow evening 
or in the weekend,
so we are hopefully still in time for the 4.19 release.

--
Sander


Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 16:26, Jens Axboe wrote:
> On 9/27/18 1:12 AM, Juergen Gross wrote:
>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>> added support for purging persistent grants when they are not in use. As
>>> part of the purge, the grants were removed from the grant buffer, This
>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>> get_free_grant(). This can be observed even on an idle system, within
>>> 20-30 minutes.
>>>
>>> We should keep the grants in the buffer when purging, and only free the
>>> grant ref.
>>>
>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>> Signed-off-by: Boris Ostrovsky 
>>
>> Reviewed-by: Juergen Gross 
> 
> Since Konrad is out, I'm going to queue this up for 4.19.
> 

Hi Boris/Juergen.

Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch from 
Boris pulled on top. 
Unfortunately it made a VM hang (probably because it's rootFS is shuffled from 
under it's feet 
and it gave these in dom0 dmesg:

[ 9251.696090] xen-blkback: requesting a grant already in use
[ 9251.705861] xen-blkback: trying to add a gref that's already in the tree
[ 9251.715781] xen-blkback: requesting a grant already in use
[ 9251.725756] xen-blkback: trying to add a gref that's already in the tree
[ 9251.735698] xen-blkback: requesting a grant already in use
[ 9251.745573] xen-blkback: trying to add a gref that's already in the tree

The VM was a HVM with 4 vcpu's and 2 phy disks:
xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
persistent grants
xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
persistent grants


Currently i have been running 4.19-rc5 with xen-next on top and commit 
a46b53672b2c reverted,
for a couple of days. That seems to run stable for me (since it's a small box 
so i'm not hit
by what a46b53672b2c tried to fix.

If you can come up with a debug patch i can give that a spin tomorrow evening 
or in the weekend,
so we are hopefully still in time for the 4.19 release.

--
Sander


Re: Linux 4.16-rc1: regression bisected, Debian kernel package tool make-kpkg stalls indefinitely during kernel build due to commit "kconfig: remove check_stdin()"

2018-03-18 Thread Sander Eikelenboom
On 13/02/18 14:07, Ulf Magnusson wrote:
> On Tue, Feb 13, 2018 at 1:35 PM, Ulf Magnusson <ulfali...@gmail.com> wrote:
>> On Tue, Feb 13, 2018 at 12:33:24PM +0100, Ulf Magnusson wrote:
>>> On Tue, Feb 13, 2018 at 11:00:49AM +0100, Sander Eikelenboom wrote:
>>>> On 13/02/18 05:09, Masahiro Yamada wrote:
>>>>> 2018-02-13 12:00 GMT+09:00 Woody Suwalski <terraluna...@gmail.com>:
>>>>>> Sander Eikelenboom wrote:
>>>>>>>
>>>>>>> L.S.,
>>>>>>>
>>>>>>> The Debian kernel-package tool make-kpkg for easy building of upstream
>>>>>>> kernels on Debian fails with linux 4.16-rc1.
>>>>>>>
>>>>>>> The tool (perl script) while invoked with:
>>>>>>>  make-kpkg --initrd --append_to_version -20180212 kernel_image
>>>>>>>
>>>>>>> On a git tree with a .config from the previous kernel release, so new
>>>>>>> KConfig questions have to be asked on new or changed options.
>>>>>>>
>>>>>>> The script stalls indefinitely while it seems to be excuting:
>>>>>>>  exec make kpkg_version=13.018+nmu1 -f
>>>>>>> /usr/share/kernel-package/ruleset/minimal.mk debian
>>>>>>> APPEND_TO_VERSION=-t440s-20180212  INITRD=YES
>>>>>>>
>>>>>>> After using ctrl-c to break out it, i get:
>>>>>>> ^CFailed to create a ./debian directory: No such file or directory 
>>>>>>> at
>>>>>>> /usr/bin/make-kpkg line 970.
>>>>>>>
>>>>>>> Bisection turned up as culprit:
>>>>>>>  commit d2a04648a5dbc3d1d043b35257364f0197d4d868
>>>>>>>  kconfig: remove check_stdin()
>>>>>>>   Except silentoldconfig, valid_stdin is 1, so check_stdin() is
>>>>>>> no-op.
>>>>>>>   oldconfig and silentoldconfig work almost in the same way 
>>>>>>> except
>>>>>>> that
>>>>>>>  the latter generates additional files under include/.  Both ask 
>>>>>>> users
>>>>>>>  for input for new symbols.
>>>>>>>   I do not know why only silentoldconfig requires stdio be tty.
>>>>>>> $ rm -f .config; touch .config
>>>>>>>$ yes "" | make oldconfig > stdout
>>>>>>>$ rm -f .config; touch .config
>>>>>>>$ yes "" | make silentoldconfig > stdout
>>>>>>>make[1]: *** [silentoldconfig] Error 1
>>>>>>>make: *** [silentoldconfig] Error 2
>>>>>>>$ tail -n 4 stdout
>>>>>>>Console input/output is redirected. Run 'make oldconfig' to 
>>>>>>> update
>>>>>>> configuration.
>>>>>>> scripts/kconfig/Makefile:40: recipe for target
>>>>>>> 'silentoldconfig' failed
>>>>>>>Makefile:507: recipe for target 'silentoldconfig' failed
>>>>>>>   Redirection is useful, for example, for testing where we want 
>>>>>>> to
>>>>>>> give
>>>>>>>  particular key inputs from a test file, then check the result.
>>>>>>>   Signed-off-by: Masahiro Yamada <yamada.masah...@socionext.com>
>>>>>>>  Reviewed-by: Ulf Magnusson <ulfali...@gmail.com>
>>>>>>>
>>>>>>> Reverting this specific commit makes make-kpkg work again as usual.
>>>>>>>
>>>>>>> Version of the kernel-package used:
>>>>>>> ii  kernel-package
>>>>>>> 13.018+nmu1
>>>>>>>
>>>>>>>
>>>>>>> I also cc'ed the Debian developer who maintains the kernel-package
>>>>>>> package: Manoj Srivastava
>>>>>>>
>>>>>>> --
>>>>>>> Sander
>>>>>>>
>>>>>> I have noticed today the same - the kernel-build blockage was in (as I
>>>>>> recall)
>>>>>> srcipts/kconfig/conf -s --silentoldconfig Kbuild
>>>>>>
>>>>>> I have bypassed it by regenerating the .config "by hand"

Re: Linux 4.16-rc1: regression bisected, Debian kernel package tool make-kpkg stalls indefinitely during kernel build due to commit "kconfig: remove check_stdin()"

2018-03-18 Thread Sander Eikelenboom
On 13/02/18 14:07, Ulf Magnusson wrote:
> On Tue, Feb 13, 2018 at 1:35 PM, Ulf Magnusson  wrote:
>> On Tue, Feb 13, 2018 at 12:33:24PM +0100, Ulf Magnusson wrote:
>>> On Tue, Feb 13, 2018 at 11:00:49AM +0100, Sander Eikelenboom wrote:
>>>> On 13/02/18 05:09, Masahiro Yamada wrote:
>>>>> 2018-02-13 12:00 GMT+09:00 Woody Suwalski :
>>>>>> Sander Eikelenboom wrote:
>>>>>>>
>>>>>>> L.S.,
>>>>>>>
>>>>>>> The Debian kernel-package tool make-kpkg for easy building of upstream
>>>>>>> kernels on Debian fails with linux 4.16-rc1.
>>>>>>>
>>>>>>> The tool (perl script) while invoked with:
>>>>>>>  make-kpkg --initrd --append_to_version -20180212 kernel_image
>>>>>>>
>>>>>>> On a git tree with a .config from the previous kernel release, so new
>>>>>>> KConfig questions have to be asked on new or changed options.
>>>>>>>
>>>>>>> The script stalls indefinitely while it seems to be excuting:
>>>>>>>  exec make kpkg_version=13.018+nmu1 -f
>>>>>>> /usr/share/kernel-package/ruleset/minimal.mk debian
>>>>>>> APPEND_TO_VERSION=-t440s-20180212  INITRD=YES
>>>>>>>
>>>>>>> After using ctrl-c to break out it, i get:
>>>>>>> ^CFailed to create a ./debian directory: No such file or directory 
>>>>>>> at
>>>>>>> /usr/bin/make-kpkg line 970.
>>>>>>>
>>>>>>> Bisection turned up as culprit:
>>>>>>>  commit d2a04648a5dbc3d1d043b35257364f0197d4d868
>>>>>>>  kconfig: remove check_stdin()
>>>>>>>   Except silentoldconfig, valid_stdin is 1, so check_stdin() is
>>>>>>> no-op.
>>>>>>>   oldconfig and silentoldconfig work almost in the same way 
>>>>>>> except
>>>>>>> that
>>>>>>>  the latter generates additional files under include/.  Both ask 
>>>>>>> users
>>>>>>>  for input for new symbols.
>>>>>>>   I do not know why only silentoldconfig requires stdio be tty.
>>>>>>> $ rm -f .config; touch .config
>>>>>>>$ yes "" | make oldconfig > stdout
>>>>>>>$ rm -f .config; touch .config
>>>>>>>$ yes "" | make silentoldconfig > stdout
>>>>>>>make[1]: *** [silentoldconfig] Error 1
>>>>>>>make: *** [silentoldconfig] Error 2
>>>>>>>$ tail -n 4 stdout
>>>>>>>Console input/output is redirected. Run 'make oldconfig' to 
>>>>>>> update
>>>>>>> configuration.
>>>>>>> scripts/kconfig/Makefile:40: recipe for target
>>>>>>> 'silentoldconfig' failed
>>>>>>>Makefile:507: recipe for target 'silentoldconfig' failed
>>>>>>>   Redirection is useful, for example, for testing where we want 
>>>>>>> to
>>>>>>> give
>>>>>>>  particular key inputs from a test file, then check the result.
>>>>>>>   Signed-off-by: Masahiro Yamada 
>>>>>>>  Reviewed-by: Ulf Magnusson 
>>>>>>>
>>>>>>> Reverting this specific commit makes make-kpkg work again as usual.
>>>>>>>
>>>>>>> Version of the kernel-package used:
>>>>>>> ii  kernel-package
>>>>>>> 13.018+nmu1
>>>>>>>
>>>>>>>
>>>>>>> I also cc'ed the Debian developer who maintains the kernel-package
>>>>>>> package: Manoj Srivastava
>>>>>>>
>>>>>>> --
>>>>>>> Sander
>>>>>>>
>>>>>> I have noticed today the same - the kernel-build blockage was in (as I
>>>>>> recall)
>>>>>> srcipts/kconfig/conf -s --silentoldconfig Kbuild
>>>>>>
>>>>>> I have bypassed it by regenerating the .config "by hand"...
>>>>>
>>>>>
>>>>> silentoldconfig asks you values for new symb

Re: Linux 4.16-rc1: regression bisected, Debian kernel package tool make-kpkg stalls indefinitely during kernel build due to commit "kconfig: remove check_stdin()"

2018-02-13 Thread Sander Eikelenboom
On 13/02/18 05:09, Masahiro Yamada wrote:
> 2018-02-13 12:00 GMT+09:00 Woody Suwalski <terraluna...@gmail.com>:
>> Sander Eikelenboom wrote:
>>>
>>> L.S.,
>>>
>>> The Debian kernel-package tool make-kpkg for easy building of upstream
>>> kernels on Debian fails with linux 4.16-rc1.
>>>
>>> The tool (perl script) while invoked with:
>>>  make-kpkg --initrd --append_to_version -20180212 kernel_image
>>>
>>> On a git tree with a .config from the previous kernel release, so new
>>> KConfig questions have to be asked on new or changed options.
>>>
>>> The script stalls indefinitely while it seems to be excuting:
>>>  exec make kpkg_version=13.018+nmu1 -f
>>> /usr/share/kernel-package/ruleset/minimal.mk debian
>>> APPEND_TO_VERSION=-t440s-20180212  INITRD=YES
>>>
>>> After using ctrl-c to break out it, i get:
>>> ^CFailed to create a ./debian directory: No such file or directory at
>>> /usr/bin/make-kpkg line 970.
>>>
>>> Bisection turned up as culprit:
>>>  commit d2a04648a5dbc3d1d043b35257364f0197d4d868
>>>  kconfig: remove check_stdin()
>>>   Except silentoldconfig, valid_stdin is 1, so check_stdin() is
>>> no-op.
>>>   oldconfig and silentoldconfig work almost in the same way except
>>> that
>>>  the latter generates additional files under include/.  Both ask users
>>>  for input for new symbols.
>>>   I do not know why only silentoldconfig requires stdio be tty.
>>> $ rm -f .config; touch .config
>>>$ yes "" | make oldconfig > stdout
>>>$ rm -f .config; touch .config
>>>$ yes "" | make silentoldconfig > stdout
>>>make[1]: *** [silentoldconfig] Error 1
>>>make: *** [silentoldconfig] Error 2
>>>$ tail -n 4 stdout
>>>Console input/output is redirected. Run 'make oldconfig' to update
>>> configuration.
>>> scripts/kconfig/Makefile:40: recipe for target
>>> 'silentoldconfig' failed
>>>Makefile:507: recipe for target 'silentoldconfig' failed
>>>   Redirection is useful, for example, for testing where we want to
>>> give
>>>  particular key inputs from a test file, then check the result.
>>>   Signed-off-by: Masahiro Yamada <yamada.masah...@socionext.com>
>>>  Reviewed-by: Ulf Magnusson <ulfali...@gmail.com>
>>>
>>> Reverting this specific commit makes make-kpkg work again as usual.
>>>
>>> Version of the kernel-package used:
>>> ii  kernel-package
>>> 13.018+nmu1
>>>
>>>
>>> I also cc'ed the Debian developer who maintains the kernel-package
>>> package: Manoj Srivastava
>>>
>>> --
>>> Sander
>>>
>> I have noticed today the same - the kernel-build blockage was in (as I
>> recall)
>> srcipts/kconfig/conf -s --silentoldconfig Kbuild
>>
>> I have bypassed it by regenerating the .config "by hand"...
> 
> 
> silentoldconfig asks you values for new symbols.
> So, you must answer questions to proceed.

I know, but it stalls before asking the questions.
 
> 
> How does 'make-kpkg' handle silentoldconfig?
> 
> Re-direct stdio, then make it forcibly fail?

I don't know, it is a bunch of perl and shell scripts that gets invoked, not 
the most easy to comprehend if you are not familiar with them. I'm just a user 
of the tool.

So i would have to defer that question to the Debian package maintainer, 
hopefully he will chime in.

--
Sander

> 
> 
> 



Re: Linux 4.16-rc1: regression bisected, Debian kernel package tool make-kpkg stalls indefinitely during kernel build due to commit "kconfig: remove check_stdin()"

2018-02-13 Thread Sander Eikelenboom
On 13/02/18 05:09, Masahiro Yamada wrote:
> 2018-02-13 12:00 GMT+09:00 Woody Suwalski :
>> Sander Eikelenboom wrote:
>>>
>>> L.S.,
>>>
>>> The Debian kernel-package tool make-kpkg for easy building of upstream
>>> kernels on Debian fails with linux 4.16-rc1.
>>>
>>> The tool (perl script) while invoked with:
>>>  make-kpkg --initrd --append_to_version -20180212 kernel_image
>>>
>>> On a git tree with a .config from the previous kernel release, so new
>>> KConfig questions have to be asked on new or changed options.
>>>
>>> The script stalls indefinitely while it seems to be excuting:
>>>  exec make kpkg_version=13.018+nmu1 -f
>>> /usr/share/kernel-package/ruleset/minimal.mk debian
>>> APPEND_TO_VERSION=-t440s-20180212  INITRD=YES
>>>
>>> After using ctrl-c to break out it, i get:
>>> ^CFailed to create a ./debian directory: No such file or directory at
>>> /usr/bin/make-kpkg line 970.
>>>
>>> Bisection turned up as culprit:
>>>  commit d2a04648a5dbc3d1d043b35257364f0197d4d868
>>>  kconfig: remove check_stdin()
>>>   Except silentoldconfig, valid_stdin is 1, so check_stdin() is
>>> no-op.
>>>   oldconfig and silentoldconfig work almost in the same way except
>>> that
>>>  the latter generates additional files under include/.  Both ask users
>>>  for input for new symbols.
>>>   I do not know why only silentoldconfig requires stdio be tty.
>>> $ rm -f .config; touch .config
>>>$ yes "" | make oldconfig > stdout
>>>$ rm -f .config; touch .config
>>>$ yes "" | make silentoldconfig > stdout
>>>make[1]: *** [silentoldconfig] Error 1
>>>make: *** [silentoldconfig] Error 2
>>>$ tail -n 4 stdout
>>>Console input/output is redirected. Run 'make oldconfig' to update
>>> configuration.
>>> scripts/kconfig/Makefile:40: recipe for target
>>> 'silentoldconfig' failed
>>>Makefile:507: recipe for target 'silentoldconfig' failed
>>>   Redirection is useful, for example, for testing where we want to
>>> give
>>>  particular key inputs from a test file, then check the result.
>>>   Signed-off-by: Masahiro Yamada 
>>>  Reviewed-by: Ulf Magnusson 
>>>
>>> Reverting this specific commit makes make-kpkg work again as usual.
>>>
>>> Version of the kernel-package used:
>>> ii  kernel-package
>>> 13.018+nmu1
>>>
>>>
>>> I also cc'ed the Debian developer who maintains the kernel-package
>>> package: Manoj Srivastava
>>>
>>> --
>>> Sander
>>>
>> I have noticed today the same - the kernel-build blockage was in (as I
>> recall)
>> srcipts/kconfig/conf -s --silentoldconfig Kbuild
>>
>> I have bypassed it by regenerating the .config "by hand"...
> 
> 
> silentoldconfig asks you values for new symbols.
> So, you must answer questions to proceed.

I know, but it stalls before asking the questions.
 
> 
> How does 'make-kpkg' handle silentoldconfig?
> 
> Re-direct stdio, then make it forcibly fail?

I don't know, it is a bunch of perl and shell scripts that gets invoked, not 
the most easy to comprehend if you are not familiar with them. I'm just a user 
of the tool.

So i would have to defer that question to the Debian package maintainer, 
hopefully he will chime in.

--
Sander

> 
> 
> 



Linux 4.16-rc1: regression bisected, Debian kernel package tool make-kpkg stalls indefinitely during kernel build due to commit "kconfig: remove check_stdin()"

2018-02-12 Thread Sander Eikelenboom
L.S.,

The Debian kernel-package tool make-kpkg for easy building of upstream kernels 
on Debian fails with linux 4.16-rc1.

The tool (perl script) while invoked with:
make-kpkg --initrd --append_to_version -20180212 kernel_image

On a git tree with a .config from the previous kernel release, so new KConfig 
questions have to be asked on new or changed options.

The script stalls indefinitely while it seems to be excuting:
exec make kpkg_version=13.018+nmu1 -f 
/usr/share/kernel-package/ruleset/minimal.mk debian 
APPEND_TO_VERSION=-t440s-20180212  INITRD=YES

After using ctrl-c to break out it, i get:
   ^CFailed to create a ./debian directory: No such file or directory at 
/usr/bin/make-kpkg line 970.
 

Bisection turned up as culprit:
commit d2a04648a5dbc3d1d043b35257364f0197d4d868
kconfig: remove check_stdin()

Except silentoldconfig, valid_stdin is 1, so check_stdin() is no-op.

oldconfig and silentoldconfig work almost in the same way except that
the latter generates additional files under include/.  Both ask users
for input for new symbols.

I do not know why only silentoldconfig requires stdio be tty.

  $ rm -f .config; touch .config
  $ yes "" | make oldconfig > stdout
  $ rm -f .config; touch .config
  $ yes "" | make silentoldconfig > stdout
  make[1]: *** [silentoldconfig] Error 1
  make: *** [silentoldconfig] Error 2
  $ tail -n 4 stdout
  Console input/output is redirected. Run 'make oldconfig' to update 
configuration.

  scripts/kconfig/Makefile:40: recipe for target 'silentoldconfig' failed
  Makefile:507: recipe for target 'silentoldconfig' failed

Redirection is useful, for example, for testing where we want to give
particular key inputs from a test file, then check the result.

Signed-off-by: Masahiro Yamada 
Reviewed-by: Ulf Magnusson 

Reverting this specific commit makes make-kpkg work again as usual.

Version of the kernel-package used:
ii  kernel-package  13.018+nmu1 


I also cc'ed the Debian developer who maintains the kernel-package package: 
Manoj Srivastava

--
Sander



Linux 4.16-rc1: regression bisected, Debian kernel package tool make-kpkg stalls indefinitely during kernel build due to commit "kconfig: remove check_stdin()"

2018-02-12 Thread Sander Eikelenboom
L.S.,

The Debian kernel-package tool make-kpkg for easy building of upstream kernels 
on Debian fails with linux 4.16-rc1.

The tool (perl script) while invoked with:
make-kpkg --initrd --append_to_version -20180212 kernel_image

On a git tree with a .config from the previous kernel release, so new KConfig 
questions have to be asked on new or changed options.

The script stalls indefinitely while it seems to be excuting:
exec make kpkg_version=13.018+nmu1 -f 
/usr/share/kernel-package/ruleset/minimal.mk debian 
APPEND_TO_VERSION=-t440s-20180212  INITRD=YES

After using ctrl-c to break out it, i get:
   ^CFailed to create a ./debian directory: No such file or directory at 
/usr/bin/make-kpkg line 970.
 

Bisection turned up as culprit:
commit d2a04648a5dbc3d1d043b35257364f0197d4d868
kconfig: remove check_stdin()

Except silentoldconfig, valid_stdin is 1, so check_stdin() is no-op.

oldconfig and silentoldconfig work almost in the same way except that
the latter generates additional files under include/.  Both ask users
for input for new symbols.

I do not know why only silentoldconfig requires stdio be tty.

  $ rm -f .config; touch .config
  $ yes "" | make oldconfig > stdout
  $ rm -f .config; touch .config
  $ yes "" | make silentoldconfig > stdout
  make[1]: *** [silentoldconfig] Error 1
  make: *** [silentoldconfig] Error 2
  $ tail -n 4 stdout
  Console input/output is redirected. Run 'make oldconfig' to update 
configuration.

  scripts/kconfig/Makefile:40: recipe for target 'silentoldconfig' failed
  Makefile:507: recipe for target 'silentoldconfig' failed

Redirection is useful, for example, for testing where we want to give
particular key inputs from a test file, then check the result.

Signed-off-by: Masahiro Yamada 
Reviewed-by: Ulf Magnusson 

Reverting this specific commit makes make-kpkg work again as usual.

Version of the kernel-package used:
ii  kernel-package  13.018+nmu1 


I also cc'ed the Debian developer who maintains the kernel-package package: 
Manoj Srivastava

--
Sander



Linux 4.14-rc6 bisected regression tun devices not working anymore in openvpn

2017-10-28 Thread Sander Eikelenboom
L.S.,

While testing a linux 4.14-rc6 kernel i noticed OpenVPN didn't function 
anymore. 
My openvpn config uses tun devices and is pretty standard.
The openvpn version is current Debian stable: openvpn 2.4.0-6+deb9u2

>From the openvpn logging:
Sat Oct 28 16:03:34 2017 us=175829 TUN/TAP device  opened
Sat Oct 28 16:03:34 2017 us=183027 Note: Cannot set tx queue length on : No 
such device (errno=19)
Sat Oct 28 16:03:34 2017 us=183055 do_ifconfig, 
tt->did_ifconfig_ipv6_setup=0
Sat Oct 28 16:03:34 2017 us=183071 /sbin/ip link set dev  up mtu 1500
Cannot find device ""
Sat Oct 28 16:03:34 2017 us=200445 Linux ip link set failed: external 
program exited with error status: 1
Sat Oct 28 16:03:34 2017 us=200482 Exiting due to fatal error
Sat Oct 28 16:38:17 2017 us=923381 TCP/UDP: Closing socket
Sat Oct 28 16:38:17 2017 us=925986 Closing TUN/TAP interface


The offending commit is: 
0ad646c81b2182f7fa67ec0c8c825e0ee165696d
"tun: call dev_get_valid_name() before register_netdevice()" 

Reverting this commit fixes the issue for me, it's unfortunate that the commit 
it self seems to fix an other issue.

--
Sander


Linux 4.14-rc6 bisected regression tun devices not working anymore in openvpn

2017-10-28 Thread Sander Eikelenboom
L.S.,

While testing a linux 4.14-rc6 kernel i noticed OpenVPN didn't function 
anymore. 
My openvpn config uses tun devices and is pretty standard.
The openvpn version is current Debian stable: openvpn 2.4.0-6+deb9u2

>From the openvpn logging:
Sat Oct 28 16:03:34 2017 us=175829 TUN/TAP device  opened
Sat Oct 28 16:03:34 2017 us=183027 Note: Cannot set tx queue length on : No 
such device (errno=19)
Sat Oct 28 16:03:34 2017 us=183055 do_ifconfig, 
tt->did_ifconfig_ipv6_setup=0
Sat Oct 28 16:03:34 2017 us=183071 /sbin/ip link set dev  up mtu 1500
Cannot find device ""
Sat Oct 28 16:03:34 2017 us=200445 Linux ip link set failed: external 
program exited with error status: 1
Sat Oct 28 16:03:34 2017 us=200482 Exiting due to fatal error
Sat Oct 28 16:38:17 2017 us=923381 TCP/UDP: Closing socket
Sat Oct 28 16:38:17 2017 us=925986 Closing TUN/TAP interface


The offending commit is: 
0ad646c81b2182f7fa67ec0c8c825e0ee165696d
"tun: call dev_get_valid_name() before register_netdevice()" 

Reverting this commit fixes the issue for me, it's unfortunate that the commit 
it self seems to fix an other issue.

--
Sander


Re: ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Sander Eikelenboom
On 26/10/17 19:49, Craig Bergstrom wrote:
> Sander, thanks for the details, they've been very useful.
> 
> I suspect that your host system's mem=2048M parameter is causing the
> problem.  Any chance you can confirm by removing the parameter and
> running the guest code path?

I removed it, but kept the hypervisor limiting dom0 memory to 2046M intact (in 
grub using the xen bootcmd: 
"multiboot   /xen-4.10.gz  dom0_mem=2048M,max:2048M ."

Unfortunately that doesn't change anything, the guest still fails to start with 
the same errors.

> More specifically, since you're telling the kernel that it's high
> memory address is at 2048M and your device is at 0xfe1fe000 (~4G), the
> new mmap() limits are preventing you from mapping addresses that are
> explicitly disallowed by the parameter.
> 

Which would probably mean the current patch prohibits hard limiting the dom0 
memory to a certain value (below 4G)
at least in combination with PCI-passthrough. So the only thing left would be 
to have no hard memory restriction on dom0
and rely on auto-ballooning, but I'm not a great fan of that.

I don't know how KVM handles setting memory limits for the host system, but 
perhaps it suffers from the same issue.

I also tried the patch from one of your last mails to make the check "less 
strict", 
but still get the same errors (when using the hard memory limits).

--
Sander

 
> 
> On Thu, Oct 26, 2017 at 10:39 AM, Ingo Molnar  wrote:
>>
>> * Craig Bergstrom  wrote:
>>
>>> Yes, not much time left for 4.14, it might be reasonable to pull the
>>> change out since it's causing problems. [...]
>>
>> Ok, I'll queue up a revert tomorrow morning and send it to Linus ASAP if 
>> there's
>> no good fix by then. In hindsight I should have queued it for v4.15 ...
>>
>> Thanks,
>>
>> Ingo



Re: ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Sander Eikelenboom
On 26/10/17 19:49, Craig Bergstrom wrote:
> Sander, thanks for the details, they've been very useful.
> 
> I suspect that your host system's mem=2048M parameter is causing the
> problem.  Any chance you can confirm by removing the parameter and
> running the guest code path?

I removed it, but kept the hypervisor limiting dom0 memory to 2046M intact (in 
grub using the xen bootcmd: 
"multiboot   /xen-4.10.gz  dom0_mem=2048M,max:2048M ."

Unfortunately that doesn't change anything, the guest still fails to start with 
the same errors.

> More specifically, since you're telling the kernel that it's high
> memory address is at 2048M and your device is at 0xfe1fe000 (~4G), the
> new mmap() limits are preventing you from mapping addresses that are
> explicitly disallowed by the parameter.
> 

Which would probably mean the current patch prohibits hard limiting the dom0 
memory to a certain value (below 4G)
at least in combination with PCI-passthrough. So the only thing left would be 
to have no hard memory restriction on dom0
and rely on auto-ballooning, but I'm not a great fan of that.

I don't know how KVM handles setting memory limits for the host system, but 
perhaps it suffers from the same issue.

I also tried the patch from one of your last mails to make the check "less 
strict", 
but still get the same errors (when using the hard memory limits).

--
Sander

 
> 
> On Thu, Oct 26, 2017 at 10:39 AM, Ingo Molnar  wrote:
>>
>> * Craig Bergstrom  wrote:
>>
>>> Yes, not much time left for 4.14, it might be reasonable to pull the
>>> change out since it's causing problems. [...]
>>
>> Ok, I'll queue up a revert tomorrow morning and send it to Linus ASAP if 
>> there's
>> no good fix by then. In hindsight I should have queued it for v4.15 ...
>>
>> Thanks,
>>
>> Ingo



Re: ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Sander Eikelenboom
On 26/10/17 10:12, Sander Eikelenboom wrote:
> On 26/10/17 10:05, Sander Eikelenboom wrote:
>> On 26/10/17 00:02, Craig Bergstrom wrote:
>>> Thanks for the notification, my apologies for the breakage.  I'll take a
>>> close look and see if I can figure out what went wrong.
>>>
>>> Sander, any chance you can send /proc/iomem and the inputs to the mmap call
>>> that fail on your affected system?
>>
>> Hi Craig,
>>
>> The output from /proc/iomem is simple to get and attached.
>> The mmap call is probably issued by qemu and will require more digging.
> 
> Ahh grepping qemu gave a pointer, it's probably the code in:
> 
> http://xenbits.xen.org/gitweb/?p=qemu-xen.git;a=blob;f=hw/xen/xen_pt_msi.c;h=ff9a79f5d27ad7d74a1b22297be560feb455063c;hb=5cd7ce5dde3f228b3b669ed9ca432f588947bd40
> 
> around line 571, that would also explain why it's only this device that
> has the problem, since it's the only one trying to use MSI(-X)
> interrupts. Will see it i can add some logging to that function.

Attached is the qemu debug output with an extra line outputting all stuff
used to calculate the arguments used by the mmap-call.
--
Sander

 
> --
> Sander
> 
> 
>>
>> I don't know if there is that much time left for 4.14, since we are at
>> RC6 already.
>>
>> --
>> Sander
>>
>>
>>>
>>>
>>> On Wed, Oct 25, 2017 at 2:50 PM, Boris Ostrovsky <boris.ostrov...@oracle.com
>>>> wrote:
>>>
>>>> On 10/23/2017 10:44 PM, Fengguang Wu wrote:
>>>>> Greetings,
>>>>>
>>>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>>>
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>>> master
>>>>>
>>>>> commit ce56a86e2ade45d052b3228cdfebe913a1ae7381
>>>>> Author: Craig Bergstrom <cra...@google.com>
>>>>> AuthorDate: Thu Oct 19 13:28:56 2017 -0600
>>>>> Commit: Ingo Molnar <mi...@kernel.org>
>>>>> CommitDate: Fri Oct 20 09:48:00 2017 +0200
>>>>>
>>>>>  x86/mm: Limit mmap() of /dev/mem to valid physical addresses
>>>>
>>>> Also note
>>>> https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02935.html
>>>>
>>>> -boris
>>>>
>>>
>>
> 

qemu-system-i386: -serial pty: char device redirected to /dev/pts/16 (label serial0)
[00:05.0] xen_pt_realize: Assigning real physical device 08:00.0 to devfn 0x28
[00:05.0] xen_pt_register_regions: IO region 0 registered (size=0x2000 base_addr=0xfe1fe000 type: 0x4)
[00:05.0] xen_pt_config_reg_init: Offset 0x000e mismatch! Emulated=0x0080, host=0x, syncing to 0x0080.
[00:05.0] xen_pt_config_reg_init: Offset 0x0010 mismatch! Emulated=0x, host=0xfe1fe004, syncing to 0xfe1fe004.
[00:05.0] xen_pt_config_reg_init: Offset 0x0052 mismatch! Emulated=0x, host=0x4803, syncing to 0x0003.
[00:05.0] xen_pt_config_reg_init: Offset 0x0072 mismatch! Emulated=0x, host=0x0086, syncing to 0x0080.
[00:05.0] xen_pt_config_reg_init: Offset 0x00a4 mismatch! Emulated=0x, host=0x8fc0, syncing to 0x8fc0.
[00:05.0] xen_pt_config_reg_init: Offset 0x00b2 mismatch! Emulated=0x, host=0x1012, syncing to 0x1012.
[00:05.0] xen_pt_msix_init: get MSI-X table BAR base 0xfe1fe000
[00:05.0] xen_pt_msix_init: table_off = 0x1000, total_entries = 8
[00:05.0] xen_pt_msix_init: table_off = 0x1000, total_entries = 8, PCI_MSIX_ENTRY_SIZE = 0x10,  msix->table_offset_adjust = 0,  msix->table_base = 0xfe1fe000
[00:05.0] xen_pt_msix_init: Error: Can't map physical MSI-X table: Invalid argument
[00:05.0] xen_pt_msix_size_init: Error: Internal error: Invalid xen_pt_msix_init.
Failed to initialize 12/15, type = 0x1, rc: -22
[00:05.0] xen_pt_msi_set_enable: disabling MSI.
*** Error in `/usr/local/lib/xen/bin/qemu-system-i386': corrupted size vs. prev_size: 0x55ce13565570 ***
=== Backtrace: =
/lib/x86_64-linux-gnu/libc.so.6(+0x70bcb)[0x7f700ab7ebcb]
/lib/x86_64-linux-gnu/libc.so.6(+0x76f96)[0x7f700ab84f96]
/lib/x86_64-linux-gnu/libc.so.6(+0x77388)[0x7f700ab85388]
/lib/x86_64-linux-gnu/libc.so.6(+0x78dca)[0x7f700ab86dca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_calloc+0x27b)[0x7f700ab89b4b]
/lib/x86_64-linux-gnu/libglib-2.0.so.0(g_malloc0+0x21)[0x7f700bbbee61]
/usr/local/lib/xen/bin/qemu-system-i386(+0x6d78ee)[0x55ce114298ee]
/usr/local/lib/xen/bin/qemu-system-i386(+0x6d309e)[0x55ce1142509e]
/usr/local/lib/xen/bin/qemu-system-i386(+0x6d316f)[0x55ce1142516f]
/usr/local/lib/xen/bin/qemu-system-i386(+0x24d79b)[0x55ce10f9f79b]
/usr/local/lib/xen/bin/qemu-system-i386(+0x6da8bf)[0x55ce1142c8bf]
/usr/local/lib/xen/bin/qemu-

Re: ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Sander Eikelenboom
On 26/10/17 10:12, Sander Eikelenboom wrote:
> On 26/10/17 10:05, Sander Eikelenboom wrote:
>> On 26/10/17 00:02, Craig Bergstrom wrote:
>>> Thanks for the notification, my apologies for the breakage.  I'll take a
>>> close look and see if I can figure out what went wrong.
>>>
>>> Sander, any chance you can send /proc/iomem and the inputs to the mmap call
>>> that fail on your affected system?
>>
>> Hi Craig,
>>
>> The output from /proc/iomem is simple to get and attached.
>> The mmap call is probably issued by qemu and will require more digging.
> 
> Ahh grepping qemu gave a pointer, it's probably the code in:
> 
> http://xenbits.xen.org/gitweb/?p=qemu-xen.git;a=blob;f=hw/xen/xen_pt_msi.c;h=ff9a79f5d27ad7d74a1b22297be560feb455063c;hb=5cd7ce5dde3f228b3b669ed9ca432f588947bd40
> 
> around line 571, that would also explain why it's only this device that
> has the problem, since it's the only one trying to use MSI(-X)
> interrupts. Will see it i can add some logging to that function.

Attached is the qemu debug output with an extra line outputting all stuff
used to calculate the arguments used by the mmap-call.
--
Sander

 
> --
> Sander
> 
> 
>>
>> I don't know if there is that much time left for 4.14, since we are at
>> RC6 already.
>>
>> --
>> Sander
>>
>>
>>>
>>>
>>> On Wed, Oct 25, 2017 at 2:50 PM, Boris Ostrovsky >>> wrote:
>>>
>>>> On 10/23/2017 10:44 PM, Fengguang Wu wrote:
>>>>> Greetings,
>>>>>
>>>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>>>
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>>> master
>>>>>
>>>>> commit ce56a86e2ade45d052b3228cdfebe913a1ae7381
>>>>> Author: Craig Bergstrom 
>>>>> AuthorDate: Thu Oct 19 13:28:56 2017 -0600
>>>>> Commit: Ingo Molnar 
>>>>> CommitDate: Fri Oct 20 09:48:00 2017 +0200
>>>>>
>>>>>  x86/mm: Limit mmap() of /dev/mem to valid physical addresses
>>>>
>>>> Also note
>>>> https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02935.html
>>>>
>>>> -boris
>>>>
>>>
>>
> 

qemu-system-i386: -serial pty: char device redirected to /dev/pts/16 (label serial0)
[00:05.0] xen_pt_realize: Assigning real physical device 08:00.0 to devfn 0x28
[00:05.0] xen_pt_register_regions: IO region 0 registered (size=0x2000 base_addr=0xfe1fe000 type: 0x4)
[00:05.0] xen_pt_config_reg_init: Offset 0x000e mismatch! Emulated=0x0080, host=0x, syncing to 0x0080.
[00:05.0] xen_pt_config_reg_init: Offset 0x0010 mismatch! Emulated=0x, host=0xfe1fe004, syncing to 0xfe1fe004.
[00:05.0] xen_pt_config_reg_init: Offset 0x0052 mismatch! Emulated=0x, host=0x4803, syncing to 0x0003.
[00:05.0] xen_pt_config_reg_init: Offset 0x0072 mismatch! Emulated=0x, host=0x0086, syncing to 0x0080.
[00:05.0] xen_pt_config_reg_init: Offset 0x00a4 mismatch! Emulated=0x, host=0x8fc0, syncing to 0x8fc0.
[00:05.0] xen_pt_config_reg_init: Offset 0x00b2 mismatch! Emulated=0x, host=0x1012, syncing to 0x1012.
[00:05.0] xen_pt_msix_init: get MSI-X table BAR base 0xfe1fe000
[00:05.0] xen_pt_msix_init: table_off = 0x1000, total_entries = 8
[00:05.0] xen_pt_msix_init: table_off = 0x1000, total_entries = 8, PCI_MSIX_ENTRY_SIZE = 0x10,  msix->table_offset_adjust = 0,  msix->table_base = 0xfe1fe000
[00:05.0] xen_pt_msix_init: Error: Can't map physical MSI-X table: Invalid argument
[00:05.0] xen_pt_msix_size_init: Error: Internal error: Invalid xen_pt_msix_init.
Failed to initialize 12/15, type = 0x1, rc: -22
[00:05.0] xen_pt_msi_set_enable: disabling MSI.
*** Error in `/usr/local/lib/xen/bin/qemu-system-i386': corrupted size vs. prev_size: 0x55ce13565570 ***
=== Backtrace: =
/lib/x86_64-linux-gnu/libc.so.6(+0x70bcb)[0x7f700ab7ebcb]
/lib/x86_64-linux-gnu/libc.so.6(+0x76f96)[0x7f700ab84f96]
/lib/x86_64-linux-gnu/libc.so.6(+0x77388)[0x7f700ab85388]
/lib/x86_64-linux-gnu/libc.so.6(+0x78dca)[0x7f700ab86dca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_calloc+0x27b)[0x7f700ab89b4b]
/lib/x86_64-linux-gnu/libglib-2.0.so.0(g_malloc0+0x21)[0x7f700bbbee61]
/usr/local/lib/xen/bin/qemu-system-i386(+0x6d78ee)[0x55ce114298ee]
/usr/local/lib/xen/bin/qemu-system-i386(+0x6d309e)[0x55ce1142509e]
/usr/local/lib/xen/bin/qemu-system-i386(+0x6d316f)[0x55ce1142516f]
/usr/local/lib/xen/bin/qemu-system-i386(+0x24d79b)[0x55ce10f9f79b]
/usr/local/lib/xen/bin/qemu-system-i386(+0x6da8bf)[0x55ce1142c8bf]
/usr/local/lib/xen/bin/qemu-system-i386(+0x70717c)[0x55ce1145917c]
/usr/local/lib/xen/bin/qemu-system-i386(+0x7072c4)[0x5

Re: ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Sander Eikelenboom
On 26/10/17 00:02, Craig Bergstrom wrote:
> Thanks for the notification, my apologies for the breakage.  I'll take a
> close look and see if I can figure out what went wrong.
> 
> Sander, any chance you can send /proc/iomem and the inputs to the mmap call
> that fail on your affected system?

Hi Craig,

The output from /proc/iomem is simple to get and attached.
The mmap call is probably issued by qemu and will require more digging.

I don't know if there is that much time left for 4.14, since we are at
RC6 already.

--
Sander


> 
> 
> On Wed, Oct 25, 2017 at 2:50 PM, Boris Ostrovsky > wrote:
> 
>> On 10/23/2017 10:44 PM, Fengguang Wu wrote:
>>> Greetings,
>>>
>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> master
>>>
>>> commit ce56a86e2ade45d052b3228cdfebe913a1ae7381
>>> Author: Craig Bergstrom 
>>> AuthorDate: Thu Oct 19 13:28:56 2017 -0600
>>> Commit: Ingo Molnar 
>>> CommitDate: Fri Oct 20 09:48:00 2017 +0200
>>>
>>>  x86/mm: Limit mmap() of /dev/mem to valid physical addresses
>>
>> Also note
>> https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02935.html
>>
>> -boris
>>
> 

-0fff : Reserved
1000-00095fff : System RAM
00096000-000963ff : RAM buffer
00096400-000f : Reserved
  000a-000b : PCI Bus :00
  000c-000cfdff : Video ROM
  000d-000d : PCI Bus :00
000d4800-000d4bff : Adapter ROM
  000f-000f : System ROM
0010-7fff : System RAM
  0100-01d2a703 : Kernel code
  01d2a704-025450ff : Kernel data
  02b3f000-02cc1fff : Kernel bss
c7f9-c7f9dfff : ACPI Tables
c7f9e000-c7fd : ACPI Non-volatile Storage
c7fe-c7ff : Reserved
c800-dfff : PCI Bus :00
  cfe0-cfef : PCI Bus :0c
cfef8000-cfefbfff : :0c:00.0
  cfef8000-cfefbfff : r8169
cfeff000-cfef : :0c:00.0
  cfeff000-cfef : r8169
  cff0-cfff : PCI Bus :0d
cfff8000-cfffbfff : :0d:00.0
  cfff8000-cfffbfff : r8169
c000-cfff : :0d:00.0
  c000-cfff : r8169
  d000-dfff : PCI Bus :0f
d000-dfff : :0f:00.0
  d000-d0ff : vesafb
e000-efff : PCI MMCONFIG  [bus 00-ff]
  e000-efff : pnp 00:07
f000-febf : PCI Bus :00
  f600-f6003fff : Reserved
f600-f6003fff : pnp 00:01
  fdcf7000-fdcf7fff : :00:12.0
fdcf7000-fdcf7fff : ohci_hcd
  fdcf8000-fdcfbfff : :00:14.2
  fdcfc000-fdcfcfff : :00:13.0
fdcfc000-fdcfcfff : ohci_hcd
  fdcfd000-fdcfdfff : :00:14.5
fdcfd000-fdcfdfff : ohci_hcd
  fdcfe000-fdcfefff : :00:16.0
fdcfe000-fdcfefff : ohci_hcd
  fdcff000-fdcff3ff : :00:11.0
fdcff000-fdcff3ff : ahci
  fdcff400-fdcff4ff : :00:12.2
fdcff400-fdcff4ff : ehci_hcd
  fdcff800-fdcff8ff : :00:13.2
fdcff800-fdcff8ff : ehci_hcd
  fdcffc00-fdcffcff : :00:16.2
fdcffc00-fdcffcff : ehci_hcd
  fde0-fdef : PCI Bus :04
fdef8000-fdef8fff : :04:00.0
fdef9000-fdef9fff : :04:00.1
fdefa000-fdefafff : :04:00.2
fdefb000-fdefbfff : :04:00.3
fdefc000-fdefcfff : :04:00.4
fdefd000-fdefdfff : :04:00.5
fdefe000-fdefefff : :04:00.6
fdeff000-fdef : :04:00.7
  fdf0-fe1f : PCI Bus :05
fdfe-fdff : :05:00.0
fe00-fe1f : PCI Bus :06
  fe00-fe0f : PCI Bus :07
fe0e-fe0e : :07:00.0
fe0ff800-fe0f : :07:00.0
  fe0ff800-fe0f : ahci
  fe10-fe1f : PCI Bus :08
fe1fe000-fe1f : :08:00.0
  fe20-fe3f : PCI Bus :09
fe20-fe3f : :09:00.0
  fe40-fe4f : PCI Bus :0a
fe4f8000-fe4f8fff : :0a:00.0
fe4f9000-fe4f9fff : :0a:00.1
fe4fa000-fe4fafff : :0a:00.2
fe4fb000-fe4fbfff : :0a:00.3
fe4fc000-fe4fcfff : :0a:00.4
fe4fd000-fe4fdfff : :0a:00.5
fe4fe000-fe4fefff : :0a:00.6
fe4ff000-fe4f : :0a:00.7
  fe50-fe5f : PCI Bus :0b
fe5fe000-fe5f : :0b:00.0
  fe60-fe6f : PCI Bus :0c
fe6e-fe6f : :0c:00.0
  fe70-fe7f : PCI Bus :0d
fe7e-fe7f : :0d:00.0
  fe80-fe8f : PCI Bus :0e
fe8fe000-fe8f : :0e:00.0
  fe90-fe9f : PCI Bus :0f
fe9e-fe9e : :0f:00.0
fe9fc000-fe9f : :0f:00.1
  fe9fc000-fe9f : ICH HD audio
fec0-fec00fff : Reserved
  fec0-fec003ff : IOAPIC 0
fec1-fec1001f : pnp 00:06
fec2-fec20fff : Reserved
  fec2-fec203ff : IOAPIC 1
fed0-fed003ff : HPET 2
  fed0-fed003ff : PNP0103:00
fed8-fed80fff : pnp 00:06
fee0-feef : Reserved
  fee0-fee00fff : Local APIC
fee0-fee00fff : pnp 00:05
ffb8-ffbf : pnp 

Re: ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Sander Eikelenboom
On 26/10/17 00:02, Craig Bergstrom wrote:
> Thanks for the notification, my apologies for the breakage.  I'll take a
> close look and see if I can figure out what went wrong.
> 
> Sander, any chance you can send /proc/iomem and the inputs to the mmap call
> that fail on your affected system?

Hi Craig,

The output from /proc/iomem is simple to get and attached.
The mmap call is probably issued by qemu and will require more digging.

I don't know if there is that much time left for 4.14, since we are at
RC6 already.

--
Sander


> 
> 
> On Wed, Oct 25, 2017 at 2:50 PM, Boris Ostrovsky > wrote:
> 
>> On 10/23/2017 10:44 PM, Fengguang Wu wrote:
>>> Greetings,
>>>
>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> master
>>>
>>> commit ce56a86e2ade45d052b3228cdfebe913a1ae7381
>>> Author: Craig Bergstrom 
>>> AuthorDate: Thu Oct 19 13:28:56 2017 -0600
>>> Commit: Ingo Molnar 
>>> CommitDate: Fri Oct 20 09:48:00 2017 +0200
>>>
>>>  x86/mm: Limit mmap() of /dev/mem to valid physical addresses
>>
>> Also note
>> https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02935.html
>>
>> -boris
>>
> 

-0fff : Reserved
1000-00095fff : System RAM
00096000-000963ff : RAM buffer
00096400-000f : Reserved
  000a-000b : PCI Bus :00
  000c-000cfdff : Video ROM
  000d-000d : PCI Bus :00
000d4800-000d4bff : Adapter ROM
  000f-000f : System ROM
0010-7fff : System RAM
  0100-01d2a703 : Kernel code
  01d2a704-025450ff : Kernel data
  02b3f000-02cc1fff : Kernel bss
c7f9-c7f9dfff : ACPI Tables
c7f9e000-c7fd : ACPI Non-volatile Storage
c7fe-c7ff : Reserved
c800-dfff : PCI Bus :00
  cfe0-cfef : PCI Bus :0c
cfef8000-cfefbfff : :0c:00.0
  cfef8000-cfefbfff : r8169
cfeff000-cfef : :0c:00.0
  cfeff000-cfef : r8169
  cff0-cfff : PCI Bus :0d
cfff8000-cfffbfff : :0d:00.0
  cfff8000-cfffbfff : r8169
c000-cfff : :0d:00.0
  c000-cfff : r8169
  d000-dfff : PCI Bus :0f
d000-dfff : :0f:00.0
  d000-d0ff : vesafb
e000-efff : PCI MMCONFIG  [bus 00-ff]
  e000-efff : pnp 00:07
f000-febf : PCI Bus :00
  f600-f6003fff : Reserved
f600-f6003fff : pnp 00:01
  fdcf7000-fdcf7fff : :00:12.0
fdcf7000-fdcf7fff : ohci_hcd
  fdcf8000-fdcfbfff : :00:14.2
  fdcfc000-fdcfcfff : :00:13.0
fdcfc000-fdcfcfff : ohci_hcd
  fdcfd000-fdcfdfff : :00:14.5
fdcfd000-fdcfdfff : ohci_hcd
  fdcfe000-fdcfefff : :00:16.0
fdcfe000-fdcfefff : ohci_hcd
  fdcff000-fdcff3ff : :00:11.0
fdcff000-fdcff3ff : ahci
  fdcff400-fdcff4ff : :00:12.2
fdcff400-fdcff4ff : ehci_hcd
  fdcff800-fdcff8ff : :00:13.2
fdcff800-fdcff8ff : ehci_hcd
  fdcffc00-fdcffcff : :00:16.2
fdcffc00-fdcffcff : ehci_hcd
  fde0-fdef : PCI Bus :04
fdef8000-fdef8fff : :04:00.0
fdef9000-fdef9fff : :04:00.1
fdefa000-fdefafff : :04:00.2
fdefb000-fdefbfff : :04:00.3
fdefc000-fdefcfff : :04:00.4
fdefd000-fdefdfff : :04:00.5
fdefe000-fdefefff : :04:00.6
fdeff000-fdef : :04:00.7
  fdf0-fe1f : PCI Bus :05
fdfe-fdff : :05:00.0
fe00-fe1f : PCI Bus :06
  fe00-fe0f : PCI Bus :07
fe0e-fe0e : :07:00.0
fe0ff800-fe0f : :07:00.0
  fe0ff800-fe0f : ahci
  fe10-fe1f : PCI Bus :08
fe1fe000-fe1f : :08:00.0
  fe20-fe3f : PCI Bus :09
fe20-fe3f : :09:00.0
  fe40-fe4f : PCI Bus :0a
fe4f8000-fe4f8fff : :0a:00.0
fe4f9000-fe4f9fff : :0a:00.1
fe4fa000-fe4fafff : :0a:00.2
fe4fb000-fe4fbfff : :0a:00.3
fe4fc000-fe4fcfff : :0a:00.4
fe4fd000-fe4fdfff : :0a:00.5
fe4fe000-fe4fefff : :0a:00.6
fe4ff000-fe4f : :0a:00.7
  fe50-fe5f : PCI Bus :0b
fe5fe000-fe5f : :0b:00.0
  fe60-fe6f : PCI Bus :0c
fe6e-fe6f : :0c:00.0
  fe70-fe7f : PCI Bus :0d
fe7e-fe7f : :0d:00.0
  fe80-fe8f : PCI Bus :0e
fe8fe000-fe8f : :0e:00.0
  fe90-fe9f : PCI Bus :0f
fe9e-fe9e : :0f:00.0
fe9fc000-fe9f : :0f:00.1
  fe9fc000-fe9f : ICH HD audio
fec0-fec00fff : Reserved
  fec0-fec003ff : IOAPIC 0
fec1-fec1001f : pnp 00:06
fec2-fec20fff : Reserved
  fec2-fec203ff : IOAPIC 1
fed0-fed003ff : HPET 2
  fed0-fed003ff : PNP0103:00
fed8-fed80fff : pnp 00:06
fee0-feef : Reserved
  fee0-fee00fff : Local APIC
fee0-fee00fff : pnp 00:05
ffb8-ffbf : pnp 00:06
ffe0- : Reserved
fd-ff : 

Re: ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Sander Eikelenboom
On 26/10/17 10:05, Sander Eikelenboom wrote:
> On 26/10/17 00:02, Craig Bergstrom wrote:
>> Thanks for the notification, my apologies for the breakage.  I'll take a
>> close look and see if I can figure out what went wrong.
>>
>> Sander, any chance you can send /proc/iomem and the inputs to the mmap call
>> that fail on your affected system?
> 
> Hi Craig,
> 
> The output from /proc/iomem is simple to get and attached.
> The mmap call is probably issued by qemu and will require more digging.

Ahh grepping qemu gave a pointer, it's probably the code in:

http://xenbits.xen.org/gitweb/?p=qemu-xen.git;a=blob;f=hw/xen/xen_pt_msi.c;h=ff9a79f5d27ad7d74a1b22297be560feb455063c;hb=5cd7ce5dde3f228b3b669ed9ca432f588947bd40

around line 571, that would also explain why it's only this device that
has the problem, since it's the only one trying to use MSI(-X)
interrupts. Will see it i can add some logging to that function.

--
Sander


> 
> I don't know if there is that much time left for 4.14, since we are at
> RC6 already.
> 
> --
> Sander
> 
> 
>>
>>
>> On Wed, Oct 25, 2017 at 2:50 PM, Boris Ostrovsky <boris.ostrov...@oracle.com
>>> wrote:
>>
>>> On 10/23/2017 10:44 PM, Fengguang Wu wrote:
>>>> Greetings,
>>>>
>>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>> master
>>>>
>>>> commit ce56a86e2ade45d052b3228cdfebe913a1ae7381
>>>> Author: Craig Bergstrom <cra...@google.com>
>>>> AuthorDate: Thu Oct 19 13:28:56 2017 -0600
>>>> Commit: Ingo Molnar <mi...@kernel.org>
>>>> CommitDate: Fri Oct 20 09:48:00 2017 +0200
>>>>
>>>>  x86/mm: Limit mmap() of /dev/mem to valid physical addresses
>>>
>>> Also note
>>> https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02935.html
>>>
>>> -boris
>>>
>>
> 



Re: ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Sander Eikelenboom
On 26/10/17 10:05, Sander Eikelenboom wrote:
> On 26/10/17 00:02, Craig Bergstrom wrote:
>> Thanks for the notification, my apologies for the breakage.  I'll take a
>> close look and see if I can figure out what went wrong.
>>
>> Sander, any chance you can send /proc/iomem and the inputs to the mmap call
>> that fail on your affected system?
> 
> Hi Craig,
> 
> The output from /proc/iomem is simple to get and attached.
> The mmap call is probably issued by qemu and will require more digging.

Ahh grepping qemu gave a pointer, it's probably the code in:

http://xenbits.xen.org/gitweb/?p=qemu-xen.git;a=blob;f=hw/xen/xen_pt_msi.c;h=ff9a79f5d27ad7d74a1b22297be560feb455063c;hb=5cd7ce5dde3f228b3b669ed9ca432f588947bd40

around line 571, that would also explain why it's only this device that
has the problem, since it's the only one trying to use MSI(-X)
interrupts. Will see it i can add some logging to that function.

--
Sander


> 
> I don't know if there is that much time left for 4.14, since we are at
> RC6 already.
> 
> --
> Sander
> 
> 
>>
>>
>> On Wed, Oct 25, 2017 at 2:50 PM, Boris Ostrovsky >> wrote:
>>
>>> On 10/23/2017 10:44 PM, Fengguang Wu wrote:
>>>> Greetings,
>>>>
>>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>> master
>>>>
>>>> commit ce56a86e2ade45d052b3228cdfebe913a1ae7381
>>>> Author: Craig Bergstrom 
>>>> AuthorDate: Thu Oct 19 13:28:56 2017 -0600
>>>> Commit: Ingo Molnar 
>>>> CommitDate: Fri Oct 20 09:48:00 2017 +0200
>>>>
>>>>  x86/mm: Limit mmap() of /dev/mem to valid physical addresses
>>>
>>> Also note
>>> https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02935.html
>>>
>>> -boris
>>>
>>
> 



4.12-RC2 BUG: scheduling while atomic: irq/47-iwlwifi

2017-05-22 Thread Sander Eikelenboom
Hi,

I encountered this splat with 4.12-RC2.
--

Sander

[  119.021594] BUG: scheduling while atomic: irq/47-iwlwifi/517/0x0200
[  119.021604] Modules linked in: xt_tcpudp ip6t_rpfilter ipt_REJECT 
nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc 
ip6table_raw ip6table_security ip6table_mangle iptable_raw iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_security iptable_mangle ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables rfcomm bnep binfmt_misc arc4 
iTCO_wdt iTCO_vendor_support uvcvideo videobuf2_vmalloc videobuf2_memops 
videobuf2_v4l2 videobuf2_core videodev intel_rapl cdc_mbim iwlmvm 
x86_pkg_temp_thermal intel_powerclamp mac80211 media cdc_wdm btusb coretemp 
cdc_ncm kvm_intel usbnet mii cdc_acm iwlwifi kvm btintel joydev pcspkr 
serio_raw cfg80211 snd_hda_codec_hdmi
[  119.021701]  bluetooth lpc_ich snd_hda_codec_realtek snd_hda_codec_generic 
shpchp sg ecdh_generic snd_hda_intel thinkpad_acpi snd_hda_codec snd_hwdep 
snd_hda_core snd_pcm snd_timer nvram snd soundcore evdev tpm_tis tpm_tis_core 
tpm algif_skcipher af_alg crct10dif_pclmul crc32_pclmul crc32c_intel 
ghash_clmulni_intel rtsx_pci_sdmmc mmc_core aesni_intel aes_x86_64 crypto_simd 
cryptd glue_helper psmouse i2c_i801 sd_mod ehci_pci ehci_hcd e1000e rtsx_pci 
mfd_core ptp xhci_pci pps_core xhci_hcd
[  119.021759] CPU: 1 PID: 517 Comm: irq/47-iwlwifi Not tainted 
4.12.0-rc2-t440s-20170522+ #1
[  119.021763] Hardware name: LENOVO 20AQS03H00/20AQS03H00, BIOS GJET91WW (2.41 
) 09/21/2016
[  119.021766] Call Trace:
[  119.021778]  ? dump_stack+0x5c/0x84
[  119.021784]  ? __schedule_bug+0x4c/0x70
[  119.021792]  ? __schedule+0x496/0x5c0
[  119.021798]  ? schedule+0x2d/0x80
[  119.021804]  ? schedule_preempt_disabled+0x5/0x10
[  119.021810]  ? __mutex_lock.isra.0+0x18e/0x4c0
[  119.021817]  ? __wake_up+0x2f/0x50
[  119.021833]  ? cfg80211_sched_scan_results+0x19/0x60 [cfg80211]
[  119.021844]  ? cfg80211_sched_scan_results+0x19/0x60 [cfg80211]
[  119.021859]  ? iwl_mvm_rx_lmac_scan_iter_complete_notif+0x17/0x30 [iwlmvm]
[  119.021869]  ? iwl_pcie_rx_handle+0x2a9/0x7e0 [iwlwifi]
[  119.021878]  ? iwl_pcie_irq_handler+0x17c/0x730 [iwlwifi]
[  119.021884]  ? irq_forced_thread_fn+0x60/0x60
[  119.021887]  ? irq_thread_fn+0x16/0x40
[  119.021892]  ? irq_thread+0x109/0x180
[  119.021896]  ? wake_threads_waitq+0x30/0x30
[  119.021901]  ? kthread+0xf2/0x130
[  119.021905]  ? irq_thread_dtor+0x90/0x90
[  119.021910]  ? kthread_create_on_node+0x40/0x40
[  119.021915]  ? ret_from_fork+0x26/0x40


4.12-RC2 BUG: scheduling while atomic: irq/47-iwlwifi

2017-05-22 Thread Sander Eikelenboom
Hi,

I encountered this splat with 4.12-RC2.
--

Sander

[  119.021594] BUG: scheduling while atomic: irq/47-iwlwifi/517/0x0200
[  119.021604] Modules linked in: xt_tcpudp ip6t_rpfilter ipt_REJECT 
nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc 
ip6table_raw ip6table_security ip6table_mangle iptable_raw iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_security iptable_mangle ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables rfcomm bnep binfmt_misc arc4 
iTCO_wdt iTCO_vendor_support uvcvideo videobuf2_vmalloc videobuf2_memops 
videobuf2_v4l2 videobuf2_core videodev intel_rapl cdc_mbim iwlmvm 
x86_pkg_temp_thermal intel_powerclamp mac80211 media cdc_wdm btusb coretemp 
cdc_ncm kvm_intel usbnet mii cdc_acm iwlwifi kvm btintel joydev pcspkr 
serio_raw cfg80211 snd_hda_codec_hdmi
[  119.021701]  bluetooth lpc_ich snd_hda_codec_realtek snd_hda_codec_generic 
shpchp sg ecdh_generic snd_hda_intel thinkpad_acpi snd_hda_codec snd_hwdep 
snd_hda_core snd_pcm snd_timer nvram snd soundcore evdev tpm_tis tpm_tis_core 
tpm algif_skcipher af_alg crct10dif_pclmul crc32_pclmul crc32c_intel 
ghash_clmulni_intel rtsx_pci_sdmmc mmc_core aesni_intel aes_x86_64 crypto_simd 
cryptd glue_helper psmouse i2c_i801 sd_mod ehci_pci ehci_hcd e1000e rtsx_pci 
mfd_core ptp xhci_pci pps_core xhci_hcd
[  119.021759] CPU: 1 PID: 517 Comm: irq/47-iwlwifi Not tainted 
4.12.0-rc2-t440s-20170522+ #1
[  119.021763] Hardware name: LENOVO 20AQS03H00/20AQS03H00, BIOS GJET91WW (2.41 
) 09/21/2016
[  119.021766] Call Trace:
[  119.021778]  ? dump_stack+0x5c/0x84
[  119.021784]  ? __schedule_bug+0x4c/0x70
[  119.021792]  ? __schedule+0x496/0x5c0
[  119.021798]  ? schedule+0x2d/0x80
[  119.021804]  ? schedule_preempt_disabled+0x5/0x10
[  119.021810]  ? __mutex_lock.isra.0+0x18e/0x4c0
[  119.021817]  ? __wake_up+0x2f/0x50
[  119.021833]  ? cfg80211_sched_scan_results+0x19/0x60 [cfg80211]
[  119.021844]  ? cfg80211_sched_scan_results+0x19/0x60 [cfg80211]
[  119.021859]  ? iwl_mvm_rx_lmac_scan_iter_complete_notif+0x17/0x30 [iwlmvm]
[  119.021869]  ? iwl_pcie_rx_handle+0x2a9/0x7e0 [iwlwifi]
[  119.021878]  ? iwl_pcie_irq_handler+0x17c/0x730 [iwlwifi]
[  119.021884]  ? irq_forced_thread_fn+0x60/0x60
[  119.021887]  ? irq_thread_fn+0x16/0x40
[  119.021892]  ? irq_thread+0x109/0x180
[  119.021896]  ? wake_threads_waitq+0x30/0x30
[  119.021901]  ? kthread+0xf2/0x130
[  119.021905]  ? irq_thread_dtor+0x90/0x90
[  119.021910]  ? kthread_create_on_node+0x40/0x40
[  119.021915]  ? ret_from_fork+0x26/0x40


Re: [PATCH] xen/x86: Initialize per_cpu(xen_vcpu, 0) a little earlier

2016-10-03 Thread Sander Eikelenboom

On 2016-10-03 00:45, Boris Ostrovsky wrote:

xen_cpuhp_setup() calls mutex_lock() which, when CONFIG_DEBUG_MUTEXES
is defined, ends up calling xen_save_fl(). That routine expects
per_cpu(xen_vcpu, 0) to be already initialized.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
Reported-by: Sander Eikelenboom <li...@eikelenboom.it>
---
Sander, please see if this fixes the problem. Thanks.


Hi Boris,

I have tested it and it fixes the dom0 crash in early boot for me.
Thanks again for investigating and the swift fix !

--
Sander



 arch/x86/xen/enlighten.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 366b6ae..96c2dea 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1644,7 +1644,6 @@ asmlinkage __visible void __init 
xen_start_kernel(void)

xen_initial_gdt = _cpu(gdt_page, 0);

xen_smp_init();
-   WARN_ON(xen_cpuhp_setup());

 #ifdef CONFIG_ACPI_NUMA
/*
@@ -1658,6 +1657,8 @@ asmlinkage __visible void __init 
xen_start_kernel(void)

   possible map and a non-dummy shared_info. */
per_cpu(xen_vcpu, 0) = _shared_info->vcpu_info[0];

+   WARN_ON(xen_cpuhp_setup());
+
local_irq_disable();
early_boot_irqs_disabled = true;


Re: [PATCH] xen/x86: Initialize per_cpu(xen_vcpu, 0) a little earlier

2016-10-03 Thread Sander Eikelenboom

On 2016-10-03 00:45, Boris Ostrovsky wrote:

xen_cpuhp_setup() calls mutex_lock() which, when CONFIG_DEBUG_MUTEXES
is defined, ends up calling xen_save_fl(). That routine expects
per_cpu(xen_vcpu, 0) to be already initialized.

Signed-off-by: Boris Ostrovsky 
Reported-by: Sander Eikelenboom 
---
Sander, please see if this fixes the problem. Thanks.


Hi Boris,

I have tested it and it fixes the dom0 crash in early boot for me.
Thanks again for investigating and the swift fix !

--
Sander



 arch/x86/xen/enlighten.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 366b6ae..96c2dea 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1644,7 +1644,6 @@ asmlinkage __visible void __init 
xen_start_kernel(void)

xen_initial_gdt = _cpu(gdt_page, 0);

xen_smp_init();
-   WARN_ON(xen_cpuhp_setup());

 #ifdef CONFIG_ACPI_NUMA
/*
@@ -1658,6 +1657,8 @@ asmlinkage __visible void __init 
xen_start_kernel(void)

   possible map and a non-dummy shared_info. */
per_cpu(xen_vcpu, 0) = _shared_info->vcpu_info[0];

+   WARN_ON(xen_cpuhp_setup());
+
local_irq_disable();
early_boot_irqs_disabled = true;


Re: [Intel-gfx] Linux 4.8-rc?: WARNING: at drivers/gpu/drm/i915/intel_pm.c:7866 sandybridge_pcode_write Missing switch case (16) in gen6_check_mailbox_status

2016-09-07 Thread Sander Eikelenboom

On 2016-09-07 16:49, Jani Nikula wrote:

On Tue, 06 Sep 2016, li...@eikelenboom.it wrote:

On 2016-09-06 11:25, Jani Nikula wrote:

On Tue, 06 Sep 2016, li...@eikelenboom.it wrote:

L.S.,

Since one of the last 4.8 RC's i'm getting the warning below when
booting on my sandybridge based thinkpad.
 From what it seems the machine still works fine though.


What does 'lspci -nns 2' say for you?


00:02.0 VGA compatible controller [0300]: Intel Corporation 2nd
Generation Core Processor Family Integrated Graphics Controller
[8086:0126] (rev 09)


Fixed in drm-intel-fixes by

commit fc2780b66b15092ac68272644a522c1624c48547
Author: Chris Wilson 
Date:   Fri Aug 26 11:59:26 2016 +0100

drm/i915: Add GEN7_PCODE_MIN_FREQ_TABLE_GT_RATIO_OUT_OF_RANGE to 
SNB


BR,
Jani.


Works-for-me, thx!

--
Sander


Re: [Intel-gfx] Linux 4.8-rc?: WARNING: at drivers/gpu/drm/i915/intel_pm.c:7866 sandybridge_pcode_write Missing switch case (16) in gen6_check_mailbox_status

2016-09-07 Thread Sander Eikelenboom

On 2016-09-07 16:49, Jani Nikula wrote:

On Tue, 06 Sep 2016, li...@eikelenboom.it wrote:

On 2016-09-06 11:25, Jani Nikula wrote:

On Tue, 06 Sep 2016, li...@eikelenboom.it wrote:

L.S.,

Since one of the last 4.8 RC's i'm getting the warning below when
booting on my sandybridge based thinkpad.
 From what it seems the machine still works fine though.


What does 'lspci -nns 2' say for you?


00:02.0 VGA compatible controller [0300]: Intel Corporation 2nd
Generation Core Processor Family Integrated Graphics Controller
[8086:0126] (rev 09)


Fixed in drm-intel-fixes by

commit fc2780b66b15092ac68272644a522c1624c48547
Author: Chris Wilson 
Date:   Fri Aug 26 11:59:26 2016 +0100

drm/i915: Add GEN7_PCODE_MIN_FREQ_TABLE_GT_RATIO_OUT_OF_RANGE to 
SNB


BR,
Jani.


Works-for-me, thx!

--
Sander


Re: [Linux 4.8-rc1 Bisected] Clock on boot Xen HVM guest starts at 31/12/1999

2016-08-12 Thread Sander Eikelenboom

Friday, August 12, 2016, 7:29:37 PM, you wrote:

> Hi,

> On 12/08/2016 at 19:23:36 +0200, Sander Eikelenboom wrote :
>> L.S.,
>> 
>> I'm seeing an issue when using a Linux 4.8-rc1 kernel in a Xen HVM guest (PV 
>> guests and dom0 are uneffected). The clock is always set to 31/12/1999 on 
>> boot 
>> of the guest, instead of the system clock time.
>> 
>> Bisecting seems to point out commit:
>> 463a86304cae92e10277b47180ac59cf93982e5b char/genrtc: x86: remove remnants 
>> of asm/rtc.h
>> 

> Isn't that solved by http://patchwork.ozlabs.org/patch/657465/ ?


Ah yes that solves it (i only looked in your git-tree to see if there was a 
patch already), sorry for the noise !

--

Sander



Re: [Linux 4.8-rc1 Bisected] Clock on boot Xen HVM guest starts at 31/12/1999

2016-08-12 Thread Sander Eikelenboom

Friday, August 12, 2016, 7:29:37 PM, you wrote:

> Hi,

> On 12/08/2016 at 19:23:36 +0200, Sander Eikelenboom wrote :
>> L.S.,
>> 
>> I'm seeing an issue when using a Linux 4.8-rc1 kernel in a Xen HVM guest (PV 
>> guests and dom0 are uneffected). The clock is always set to 31/12/1999 on 
>> boot 
>> of the guest, instead of the system clock time.
>> 
>> Bisecting seems to point out commit:
>> 463a86304cae92e10277b47180ac59cf93982e5b char/genrtc: x86: remove remnants 
>> of asm/rtc.h
>> 

> Isn't that solved by http://patchwork.ozlabs.org/patch/657465/ ?


Ah yes that solves it (i only looked in your git-tree to see if there was a 
patch already), sorry for the noise !

--

Sander



[Linux 4.8-rc1 Bisected] Clock on boot Xen HVM guest starts at 31/12/1999

2016-08-12 Thread Sander Eikelenboom
L.S.,

I'm seeing an issue when using a Linux 4.8-rc1 kernel in a Xen HVM guest (PV 
guests and dom0 are uneffected). The clock is always set to 31/12/1999 on boot 
of the guest, instead of the system clock time.

Bisecting seems to point out commit:
463a86304cae92e10277b47180ac59cf93982e5b char/genrtc: x86: remove remnants of 
asm/rtc.h

--
Sander



[Linux 4.8-rc1 Bisected] Clock on boot Xen HVM guest starts at 31/12/1999

2016-08-12 Thread Sander Eikelenboom
L.S.,

I'm seeing an issue when using a Linux 4.8-rc1 kernel in a Xen HVM guest (PV 
guests and dom0 are uneffected). The clock is always set to 31/12/1999 on boot 
of the guest, instead of the system clock time.

Bisecting seems to point out commit:
463a86304cae92e10277b47180ac59cf93982e5b char/genrtc: x86: remove remnants of 
asm/rtc.h

--
Sander



Re: nf_unregister_net_hook: hook not found!

2015-12-30 Thread Sander Eikelenboom

On 2015-12-30 03:39, ebied...@xmission.com wrote:

Pablo Neira Ayuso  writes:


On Mon, Dec 28, 2015 at 09:05:03PM +0100, Sander Eikelenboom wrote:

Hi,

Running a 4.4.0-rc6 kernel i encountered the warning below.


Cc'ing Eric Biederman.

@Sander, could you provide a way to reproduce this?


I am on vacation until the new year, but if this is reproducible we
should be able to print out reg, reg->pf, reg->hooknum, reg->hook
to figure out which hook is having something very weird happen to it.

This is happening in some network namespace exit.

Eric



Unfortunately i have found no way to reproduce,
13 seconds implies it was at boot, but i only have seen this once.

--
Sander


Thanks.


[   13.740472] ip_tables: (C) 2000-2006 Netfilter Core Team
[   13.936237] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   13.945391] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   13.947434] iwlwifi :03:00.0: Radio type=0x2-0x1-0x0
[   14.223990] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   14.232065] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   14.233570] iwlwifi :03:00.0: Radio type=0x2-0x1-0x0
[   14.328141] systemd-logind[2485]: Failed to start user service: 
Unknown

unit: user@117.service
[   14.356634] systemd-logind[2485]: New session c1 of user lightdm.
[   14.357320] [ cut here ]
[   14.357327] WARNING: CPU: 2 PID: 102 at net/netfilter/core.c:143
netfilter_net_exit+0x25/0x50()
[   14.357328] nf_unregister_net_hook: hook not found!
[   14.357371] Modules linked in: iptable_security(+) iptable_raw
iptable_filter ip_tables x_tables input_polldev bnep binfmt_misc nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc 
uvcvideo

videobuf2_vmalloc iTCO_wdt arc4 videobuf2_memops iTCO_vendor_support
intel_rapl iosf_mbi videobuf2_v4l2 x86_pkg_temp_thermal 
intel_powerclamp
btusb coretemp snd_hda_codec_hdmi iwldvm videobuf2_core btrtl 
kvm_intel
v4l2_common mac80211 videodev btbcm snd_hda_codec_conexant btintel 
media kvm
snd_hda_codec_generic bluetooth psmouse thinkpad_acpi iwlwifi 
snd_hda_intel
pcspkr serio_raw snd_hda_codec nvram cfg80211 snd_hwdep snd_hda_core 
rfkill
i2c_i801 lpc_ich snd_pcm mfd_core snd_timer evdev snd soundcore 
shpchp
tpm_tis tpm algif_skcipher af_alg crct10dif_pclmul crc32_pclmul 
crc32c_intel

aesni_intel
[   14.357380]  ehci_pci sdhci_pci aes_x86_64 glue_helper ehci_hcd 
e1000e
lrw ablk_helper sg sdhci cryptd sd_mod ptp mmc_core usbcore 
usb_common

pps_core
[   14.357383] CPU: 2 PID: 102 Comm: kworker/u16:3 Tainted: G U
4.4.0-rc6-x220-20151224+ #1
[   14.357384] Hardware name: LENOVO 42912ZU/42912ZU, BIOS 8DET69WW 
(1.39 )

07/18/2013
[   14.357390] Workqueue: netns cleanup_net
[   14.357393]  81a27dfd 81359c69 88030e7cbd40
81060297
[   14.357395]  88030e820d80 88030e7cbd90 81c962d8
81c962e0
[   14.357397]  88030e7cbdf8 81060317 81a2c010
88030018
[   14.357398] Call Trace:
[   14.357405]  [] ? dump_stack+0x40/0x57
[   14.357408]  [] ? warn_slowpath_common+0x77/0xb0
[   14.357410]  [] ? warn_slowpath_fmt+0x47/0x50
[   14.357416]  [] ? mutex_lock+0x9/0x30
[   14.357418]  [] ? netfilter_net_exit+0x25/0x50
[   14.357421]  [] ? ops_exit_list.isra.6+0x2e/0x60
[   14.357424]  [] ? cleanup_net+0x1ab/0x280
[   14.357427]  [] ? process_one_work+0x133/0x330
[   14.357429]  [] ? worker_thread+0x60/0x470
[   14.357430]  [] ? process_one_work+0x330/0x330
[   14.357434]  [] ? kthread+0xca/0xe0
[   14.357436]  [] ? 
kthread_create_on_node+0x170/0x170

[   14.357439]  [] ? ret_from_fork+0x3f/0x70
[   14.357441]  [] ? 
kthread_create_on_node+0x170/0x170

[   14.357443] ---[ end trace 9984cc4b0e89f818 ]---
[   14.357443] [ cut here ]
[   14.357446] WARNING: CPU: 2 PID: 102 at net/netfilter/core.c:143
netfilter_net_exit+0x25/0x50()
[   14.357446] nf_unregister_net_hook: hook not found!
[   14.357472] Modules linked in: iptable_security(+) iptable_raw
iptable_filter ip_tables x_tables input_polldev bnep binfmt_misc nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc 
uvcvideo

videobuf2_vmalloc iTCO_wdt arc4 videobuf2_memops iTCO_vendor_support
intel_rapl iosf_mbi videobuf2_v4l2 x86_pkg_temp_thermal 
intel_powerclamp
btusb coretemp snd_hda_codec_hdmi iwldvm videobuf2_core btrtl 
kvm_intel
v4l2_common mac80211 videodev btbcm snd_hda_codec_conexant btintel 
media kvm
snd_hda_codec_generic bluetooth psmouse thinkpad_acpi iwlwifi 
snd_hda_intel
pcspkr serio_raw snd_hda_codec nvram cfg80211 snd_hwdep snd_hda_core 
rfkill
i2c_i801 lpc_ich snd_pcm mfd_core snd_timer evdev snd soundcore 
shpchp
tpm_tis tpm algif_skcipher af_alg crct10dif_pclmul crc32_pclmul 
crc32c_intel

aesni_intel
[   14.357478]  ehci_pci sdhci_pci aes_x86_64 glue_helper ehci_hcd 
e1000e
lrw ablk_helper sg sdhci cryptd sd_mod ptp mmc_core usbcore 
usb_common

pps_core
[   14.357480] CPU: 2 PID: 102 Comm: kworker/u16:3 Tainted: G U  
W

4.4.0

Re: nf_unregister_net_hook: hook not found!

2015-12-30 Thread Sander Eikelenboom

On 2015-12-30 03:39, ebied...@xmission.com wrote:

Pablo Neira Ayuso <pa...@netfilter.org> writes:


On Mon, Dec 28, 2015 at 09:05:03PM +0100, Sander Eikelenboom wrote:

Hi,

Running a 4.4.0-rc6 kernel i encountered the warning below.


Cc'ing Eric Biederman.

@Sander, could you provide a way to reproduce this?


I am on vacation until the new year, but if this is reproducible we
should be able to print out reg, reg->pf, reg->hooknum, reg->hook
to figure out which hook is having something very weird happen to it.

This is happening in some network namespace exit.

Eric



Unfortunately i have found no way to reproduce,
13 seconds implies it was at boot, but i only have seen this once.

--
Sander


Thanks.


[   13.740472] ip_tables: (C) 2000-2006 Netfilter Core Team
[   13.936237] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   13.945391] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   13.947434] iwlwifi :03:00.0: Radio type=0x2-0x1-0x0
[   14.223990] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   14.232065] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   14.233570] iwlwifi :03:00.0: Radio type=0x2-0x1-0x0
[   14.328141] systemd-logind[2485]: Failed to start user service: 
Unknown

unit: user@117.service
[   14.356634] systemd-logind[2485]: New session c1 of user lightdm.
[   14.357320] [ cut here ]
[   14.357327] WARNING: CPU: 2 PID: 102 at net/netfilter/core.c:143
netfilter_net_exit+0x25/0x50()
[   14.357328] nf_unregister_net_hook: hook not found!
[   14.357371] Modules linked in: iptable_security(+) iptable_raw
iptable_filter ip_tables x_tables input_polldev bnep binfmt_misc nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc 
uvcvideo

videobuf2_vmalloc iTCO_wdt arc4 videobuf2_memops iTCO_vendor_support
intel_rapl iosf_mbi videobuf2_v4l2 x86_pkg_temp_thermal 
intel_powerclamp
btusb coretemp snd_hda_codec_hdmi iwldvm videobuf2_core btrtl 
kvm_intel
v4l2_common mac80211 videodev btbcm snd_hda_codec_conexant btintel 
media kvm
snd_hda_codec_generic bluetooth psmouse thinkpad_acpi iwlwifi 
snd_hda_intel
pcspkr serio_raw snd_hda_codec nvram cfg80211 snd_hwdep snd_hda_core 
rfkill
i2c_i801 lpc_ich snd_pcm mfd_core snd_timer evdev snd soundcore 
shpchp
tpm_tis tpm algif_skcipher af_alg crct10dif_pclmul crc32_pclmul 
crc32c_intel

aesni_intel
[   14.357380]  ehci_pci sdhci_pci aes_x86_64 glue_helper ehci_hcd 
e1000e
lrw ablk_helper sg sdhci cryptd sd_mod ptp mmc_core usbcore 
usb_common

pps_core
[   14.357383] CPU: 2 PID: 102 Comm: kworker/u16:3 Tainted: G U
4.4.0-rc6-x220-20151224+ #1
[   14.357384] Hardware name: LENOVO 42912ZU/42912ZU, BIOS 8DET69WW 
(1.39 )

07/18/2013
[   14.357390] Workqueue: netns cleanup_net
[   14.357393]  81a27dfd 81359c69 88030e7cbd40
81060297
[   14.357395]  88030e820d80 88030e7cbd90 81c962d8
81c962e0
[   14.357397]  88030e7cbdf8 81060317 81a2c010
88030018
[   14.357398] Call Trace:
[   14.357405]  [] ? dump_stack+0x40/0x57
[   14.357408]  [] ? warn_slowpath_common+0x77/0xb0
[   14.357410]  [] ? warn_slowpath_fmt+0x47/0x50
[   14.357416]  [] ? mutex_lock+0x9/0x30
[   14.357418]  [] ? netfilter_net_exit+0x25/0x50
[   14.357421]  [] ? ops_exit_list.isra.6+0x2e/0x60
[   14.357424]  [] ? cleanup_net+0x1ab/0x280
[   14.357427]  [] ? process_one_work+0x133/0x330
[   14.357429]  [] ? worker_thread+0x60/0x470
[   14.357430]  [] ? process_one_work+0x330/0x330
[   14.357434]  [] ? kthread+0xca/0xe0
[   14.357436]  [] ? 
kthread_create_on_node+0x170/0x170

[   14.357439]  [] ? ret_from_fork+0x3f/0x70
[   14.357441]  [] ? 
kthread_create_on_node+0x170/0x170

[   14.357443] ---[ end trace 9984cc4b0e89f818 ]---
[   14.357443] [ cut here ]
[   14.357446] WARNING: CPU: 2 PID: 102 at net/netfilter/core.c:143
netfilter_net_exit+0x25/0x50()
[   14.357446] nf_unregister_net_hook: hook not found!
[   14.357472] Modules linked in: iptable_security(+) iptable_raw
iptable_filter ip_tables x_tables input_polldev bnep binfmt_misc nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc 
uvcvideo

videobuf2_vmalloc iTCO_wdt arc4 videobuf2_memops iTCO_vendor_support
intel_rapl iosf_mbi videobuf2_v4l2 x86_pkg_temp_thermal 
intel_powerclamp
btusb coretemp snd_hda_codec_hdmi iwldvm videobuf2_core btrtl 
kvm_intel
v4l2_common mac80211 videodev btbcm snd_hda_codec_conexant btintel 
media kvm
snd_hda_codec_generic bluetooth psmouse thinkpad_acpi iwlwifi 
snd_hda_intel
pcspkr serio_raw snd_hda_codec nvram cfg80211 snd_hwdep snd_hda_core 
rfkill
i2c_i801 lpc_ich snd_pcm mfd_core snd_timer evdev snd soundcore 
shpchp
tpm_tis tpm algif_skcipher af_alg crct10dif_pclmul crc32_pclmul 
crc32c_intel

aesni_intel
[   14.357478]  ehci_pci sdhci_pci aes_x86_64 glue_helper ehci_hcd 
e1000e
lrw ablk_helper sg sdhci cryptd sd_mod ptp mmc_core usbcore 
usb_common

pps_core
[   14.357480] CPU: 2 PID: 102 Comm: kworker/u16:3 Taint

nf_unregister_net_hook: hook not found!

2015-12-28 Thread Sander Eikelenboom

Hi,

Running a 4.4.0-rc6 kernel i encountered the warning below.

--
Sander



[   13.740472] ip_tables: (C) 2000-2006 Netfilter Core Team
[   13.936237] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   13.945391] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   13.947434] iwlwifi :03:00.0: Radio type=0x2-0x1-0x0
[   14.223990] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   14.232065] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   14.233570] iwlwifi :03:00.0: Radio type=0x2-0x1-0x0
[   14.328141] systemd-logind[2485]: Failed to start user service: 
Unknown unit: user@117.service

[   14.356634] systemd-logind[2485]: New session c1 of user lightdm.
[   14.357320] [ cut here ]
[   14.357327] WARNING: CPU: 2 PID: 102 at net/netfilter/core.c:143 
netfilter_net_exit+0x25/0x50()

[   14.357328] nf_unregister_net_hook: hook not found!
[   14.357371] Modules linked in: iptable_security(+) iptable_raw 
iptable_filter ip_tables x_tables input_polldev bnep binfmt_misc nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc uvcvideo 
videobuf2_vmalloc iTCO_wdt arc4 videobuf2_memops iTCO_vendor_support 
intel_rapl iosf_mbi videobuf2_v4l2 x86_pkg_temp_thermal intel_powerclamp 
btusb coretemp snd_hda_codec_hdmi iwldvm videobuf2_core btrtl kvm_intel 
v4l2_common mac80211 videodev btbcm snd_hda_codec_conexant btintel media 
kvm snd_hda_codec_generic bluetooth psmouse thinkpad_acpi iwlwifi 
snd_hda_intel pcspkr serio_raw snd_hda_codec nvram cfg80211 snd_hwdep 
snd_hda_core rfkill i2c_i801 lpc_ich snd_pcm mfd_core snd_timer evdev 
snd soundcore shpchp tpm_tis tpm algif_skcipher af_alg crct10dif_pclmul 
crc32_pclmul crc32c_intel aesni_intel
[   14.357380]  ehci_pci sdhci_pci aes_x86_64 glue_helper ehci_hcd 
e1000e lrw ablk_helper sg sdhci cryptd sd_mod ptp mmc_core usbcore 
usb_common pps_core
[   14.357383] CPU: 2 PID: 102 Comm: kworker/u16:3 Tainted: G U  
4.4.0-rc6-x220-20151224+ #1
[   14.357384] Hardware name: LENOVO 42912ZU/42912ZU, BIOS 8DET69WW 
(1.39 ) 07/18/2013

[   14.357390] Workqueue: netns cleanup_net
[   14.357393]  81a27dfd 81359c69 88030e7cbd40 
81060297
[   14.357395]  88030e820d80 88030e7cbd90 81c962d8 
81c962e0
[   14.357397]  88030e7cbdf8 81060317 81a2c010 
88030018

[   14.357398] Call Trace:
[   14.357405]  [] ? dump_stack+0x40/0x57
[   14.357408]  [] ? warn_slowpath_common+0x77/0xb0
[   14.357410]  [] ? warn_slowpath_fmt+0x47/0x50
[   14.357416]  [] ? mutex_lock+0x9/0x30
[   14.357418]  [] ? netfilter_net_exit+0x25/0x50
[   14.357421]  [] ? ops_exit_list.isra.6+0x2e/0x60
[   14.357424]  [] ? cleanup_net+0x1ab/0x280
[   14.357427]  [] ? process_one_work+0x133/0x330
[   14.357429]  [] ? worker_thread+0x60/0x470
[   14.357430]  [] ? process_one_work+0x330/0x330
[   14.357434]  [] ? kthread+0xca/0xe0
[   14.357436]  [] ? 
kthread_create_on_node+0x170/0x170

[   14.357439]  [] ? ret_from_fork+0x3f/0x70
[   14.357441]  [] ? 
kthread_create_on_node+0x170/0x170

[   14.357443] ---[ end trace 9984cc4b0e89f818 ]---
[   14.357443] [ cut here ]
[   14.357446] WARNING: CPU: 2 PID: 102 at net/netfilter/core.c:143 
netfilter_net_exit+0x25/0x50()

[   14.357446] nf_unregister_net_hook: hook not found!
[   14.357472] Modules linked in: iptable_security(+) iptable_raw 
iptable_filter ip_tables x_tables input_polldev bnep binfmt_misc nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc uvcvideo 
videobuf2_vmalloc iTCO_wdt arc4 videobuf2_memops iTCO_vendor_support 
intel_rapl iosf_mbi videobuf2_v4l2 x86_pkg_temp_thermal intel_powerclamp 
btusb coretemp snd_hda_codec_hdmi iwldvm videobuf2_core btrtl kvm_intel 
v4l2_common mac80211 videodev btbcm snd_hda_codec_conexant btintel media 
kvm snd_hda_codec_generic bluetooth psmouse thinkpad_acpi iwlwifi 
snd_hda_intel pcspkr serio_raw snd_hda_codec nvram cfg80211 snd_hwdep 
snd_hda_core rfkill i2c_i801 lpc_ich snd_pcm mfd_core snd_timer evdev 
snd soundcore shpchp tpm_tis tpm algif_skcipher af_alg crct10dif_pclmul 
crc32_pclmul crc32c_intel aesni_intel
[   14.357478]  ehci_pci sdhci_pci aes_x86_64 glue_helper ehci_hcd 
e1000e lrw ablk_helper sg sdhci cryptd sd_mod ptp mmc_core usbcore 
usb_common pps_core
[   14.357480] CPU: 2 PID: 102 Comm: kworker/u16:3 Tainted: G U  W   
4.4.0-rc6-x220-20151224+ #1
[   14.357481] Hardware name: LENOVO 42912ZU/42912ZU, BIOS 8DET69WW 
(1.39 ) 07/18/2013

[   14.357484] Workqueue: netns cleanup_net
[   14.357486]  81a27dfd 81359c69 88030e7cbd40 
81060297
[   14.357488]  88030e820db8 88030e7cbd90 81c962d8 
81c962e0
[   14.357489]  88030e7cbdf8 81060317 81a2c010 
88030018

[   14.357490] Call Trace:
[   14.357493]  [] ? dump_stack+0x40/0x57
[   14.357495]  [] ? warn_slowpath_common+0x77/0xb0
[   14.357497]  [] ? warn_slowpath_fmt+0x47/0x50
[   14.357499]  [] ? 

nf_unregister_net_hook: hook not found!

2015-12-28 Thread Sander Eikelenboom

Hi,

Running a 4.4.0-rc6 kernel i encountered the warning below.

--
Sander



[   13.740472] ip_tables: (C) 2000-2006 Netfilter Core Team
[   13.936237] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   13.945391] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   13.947434] iwlwifi :03:00.0: Radio type=0x2-0x1-0x0
[   14.223990] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   14.232065] iwlwifi :03:00.0: L1 Enabled - LTR Disabled
[   14.233570] iwlwifi :03:00.0: Radio type=0x2-0x1-0x0
[   14.328141] systemd-logind[2485]: Failed to start user service: 
Unknown unit: user@117.service

[   14.356634] systemd-logind[2485]: New session c1 of user lightdm.
[   14.357320] [ cut here ]
[   14.357327] WARNING: CPU: 2 PID: 102 at net/netfilter/core.c:143 
netfilter_net_exit+0x25/0x50()

[   14.357328] nf_unregister_net_hook: hook not found!
[   14.357371] Modules linked in: iptable_security(+) iptable_raw 
iptable_filter ip_tables x_tables input_polldev bnep binfmt_misc nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc uvcvideo 
videobuf2_vmalloc iTCO_wdt arc4 videobuf2_memops iTCO_vendor_support 
intel_rapl iosf_mbi videobuf2_v4l2 x86_pkg_temp_thermal intel_powerclamp 
btusb coretemp snd_hda_codec_hdmi iwldvm videobuf2_core btrtl kvm_intel 
v4l2_common mac80211 videodev btbcm snd_hda_codec_conexant btintel media 
kvm snd_hda_codec_generic bluetooth psmouse thinkpad_acpi iwlwifi 
snd_hda_intel pcspkr serio_raw snd_hda_codec nvram cfg80211 snd_hwdep 
snd_hda_core rfkill i2c_i801 lpc_ich snd_pcm mfd_core snd_timer evdev 
snd soundcore shpchp tpm_tis tpm algif_skcipher af_alg crct10dif_pclmul 
crc32_pclmul crc32c_intel aesni_intel
[   14.357380]  ehci_pci sdhci_pci aes_x86_64 glue_helper ehci_hcd 
e1000e lrw ablk_helper sg sdhci cryptd sd_mod ptp mmc_core usbcore 
usb_common pps_core
[   14.357383] CPU: 2 PID: 102 Comm: kworker/u16:3 Tainted: G U  
4.4.0-rc6-x220-20151224+ #1
[   14.357384] Hardware name: LENOVO 42912ZU/42912ZU, BIOS 8DET69WW 
(1.39 ) 07/18/2013

[   14.357390] Workqueue: netns cleanup_net
[   14.357393]  81a27dfd 81359c69 88030e7cbd40 
81060297
[   14.357395]  88030e820d80 88030e7cbd90 81c962d8 
81c962e0
[   14.357397]  88030e7cbdf8 81060317 81a2c010 
88030018

[   14.357398] Call Trace:
[   14.357405]  [] ? dump_stack+0x40/0x57
[   14.357408]  [] ? warn_slowpath_common+0x77/0xb0
[   14.357410]  [] ? warn_slowpath_fmt+0x47/0x50
[   14.357416]  [] ? mutex_lock+0x9/0x30
[   14.357418]  [] ? netfilter_net_exit+0x25/0x50
[   14.357421]  [] ? ops_exit_list.isra.6+0x2e/0x60
[   14.357424]  [] ? cleanup_net+0x1ab/0x280
[   14.357427]  [] ? process_one_work+0x133/0x330
[   14.357429]  [] ? worker_thread+0x60/0x470
[   14.357430]  [] ? process_one_work+0x330/0x330
[   14.357434]  [] ? kthread+0xca/0xe0
[   14.357436]  [] ? 
kthread_create_on_node+0x170/0x170

[   14.357439]  [] ? ret_from_fork+0x3f/0x70
[   14.357441]  [] ? 
kthread_create_on_node+0x170/0x170

[   14.357443] ---[ end trace 9984cc4b0e89f818 ]---
[   14.357443] [ cut here ]
[   14.357446] WARNING: CPU: 2 PID: 102 at net/netfilter/core.c:143 
netfilter_net_exit+0x25/0x50()

[   14.357446] nf_unregister_net_hook: hook not found!
[   14.357472] Modules linked in: iptable_security(+) iptable_raw 
iptable_filter ip_tables x_tables input_polldev bnep binfmt_misc nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc uvcvideo 
videobuf2_vmalloc iTCO_wdt arc4 videobuf2_memops iTCO_vendor_support 
intel_rapl iosf_mbi videobuf2_v4l2 x86_pkg_temp_thermal intel_powerclamp 
btusb coretemp snd_hda_codec_hdmi iwldvm videobuf2_core btrtl kvm_intel 
v4l2_common mac80211 videodev btbcm snd_hda_codec_conexant btintel media 
kvm snd_hda_codec_generic bluetooth psmouse thinkpad_acpi iwlwifi 
snd_hda_intel pcspkr serio_raw snd_hda_codec nvram cfg80211 snd_hwdep 
snd_hda_core rfkill i2c_i801 lpc_ich snd_pcm mfd_core snd_timer evdev 
snd soundcore shpchp tpm_tis tpm algif_skcipher af_alg crct10dif_pclmul 
crc32_pclmul crc32c_intel aesni_intel
[   14.357478]  ehci_pci sdhci_pci aes_x86_64 glue_helper ehci_hcd 
e1000e lrw ablk_helper sg sdhci cryptd sd_mod ptp mmc_core usbcore 
usb_common pps_core
[   14.357480] CPU: 2 PID: 102 Comm: kworker/u16:3 Tainted: G U  W   
4.4.0-rc6-x220-20151224+ #1
[   14.357481] Hardware name: LENOVO 42912ZU/42912ZU, BIOS 8DET69WW 
(1.39 ) 07/18/2013

[   14.357484] Workqueue: netns cleanup_net
[   14.357486]  81a27dfd 81359c69 88030e7cbd40 
81060297
[   14.357488]  88030e820db8 88030e7cbd90 81c962d8 
81c962e0
[   14.357489]  88030e7cbdf8 81060317 81a2c010 
88030018

[   14.357490] Call Trace:
[   14.357493]  [] ? dump_stack+0x40/0x57
[   14.357495]  [] ? warn_slowpath_common+0x77/0xb0
[   14.357497]  [] ? warn_slowpath_fmt+0x47/0x50
[   14.357499]  [] ? 

Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu

2015-12-14 Thread Sander Eikelenboom

On 2015-12-14 20:48, Eric Shelton wrote:

Please note that the same issue appears to have been introduced in the
recent 4.2.7 kernel.  It perhaps has to do
with b4ff8389ed14b849354b59ce9b360bdefcdbf99c having a matching
commit e8d097151d309eb71f750bbf34e6a7ef6256da7e in linux-stable.git.  
The

below patch to arch/x86/kernel/rtc.c was also effective for 4.2.7.

Eric


Hi Eric,

Yeah it's unfortunate the patch patching the other patches destined for 
stable didn't make it in time for stable :(.
Any how the chosen solution wasn't ideal so there now is a V2 patch by 
Boris. It hasn't been picked up yet,
but hopefully will be anytime soon (for the patch see 
http://lkml.iu.edu/hypermail/linux/kernel/1512.1/03504.html)


--
Sander


On 2015-12-02 18:30, Sander Eikelenboom wrote:

On 2015-12-02 15:55, David Vrabel wrote:
> On 28/11/15 15:47, Sander Eikelenboom wrote:
>> genirq: Flags mismatch irq 8.  (hvc_console) vs. 
>> (rtc0)
>
> We shouldn't register an rtc_cmos device because its legacy irq
> conflicts with the irq needed for hvc0.  For a multi VCPU guest irq 8
> is
> in use for the pv spinlocks and this gets requested first, preventing
> the rtc device from probing.
>
> Does this patch fix it for you?
>
> David

It does, thanks.

Reported-and-tested-by: Sander Eikelenboom 

--
Sander

> 8<
> x86: rtc_cmos platform device requires legacy irqs
>
> Adding the rtc platform device when there are no legacy irqs (no
> legacy PIC) causes a conflict with other devices that end up using the
> same irq number.
>
> In a single VCPU PV guest we should have:
>
> /proc/interrupts:
>CPU0
>   0:   4934  xen-percpu-virq  timer0
>   1:  0  xen-percpu-ipi   spinlock0
>   2:  0  xen-percpu-ipi   resched0
>   3:  0  xen-percpu-ipi   callfunc0
>   4:  0  xen-percpu-virq  debug0
>   5:  0  xen-percpu-ipi   callfuncsingle0
>   6:  0  xen-percpu-ipi   irqwork0
>   7:321   xen-dyn-event xenbus
>   8: 90   xen-dyn-event hvc_console
>   ...
>
> But hvc_console cannot get its interrupt because it is already in use
> by rtc0 and the console does not work.
>
>   genirq: Flags mismatch irq 8.  (hvc_console) vs. 
> (rtc0)
>
> The rtc_cmos device requires a particular legacy irq so don't add it
> if there are no legacy irqs.
>
> Signed-off-by: David Vrabel 
> ---
>  arch/x86/kernel/rtc.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/kernel/rtc.c b/arch/x86/kernel/rtc.c
> index cd96852..07c70f1 100644
> --- a/arch/x86/kernel/rtc.c
> +++ b/arch/x86/kernel/rtc.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #ifdef CONFIG_X86_32
>  /*
> @@ -200,6 +201,10 @@ static __init int add_rtc_cmos(void)
>   }
>  #endif
>
> + /* RTC uses legacy IRQs. */
> + if (!nr_legacy_irqs())
> + return -ENODEV;
> +
>   platform_device_register(_device);
>   dev_info(_device.dev,
>"registered platform RTC device (no PNP device

found)\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu

2015-12-14 Thread Sander Eikelenboom

On 2015-12-14 20:48, Eric Shelton wrote:

Please note that the same issue appears to have been introduced in the
recent 4.2.7 kernel.  It perhaps has to do
with b4ff8389ed14b849354b59ce9b360bdefcdbf99c having a matching
commit e8d097151d309eb71f750bbf34e6a7ef6256da7e in linux-stable.git.  
The

below patch to arch/x86/kernel/rtc.c was also effective for 4.2.7.

Eric


Hi Eric,

Yeah it's unfortunate the patch patching the other patches destined for 
stable didn't make it in time for stable :(.
Any how the chosen solution wasn't ideal so there now is a V2 patch by 
Boris. It hasn't been picked up yet,
but hopefully will be anytime soon (for the patch see 
http://lkml.iu.edu/hypermail/linux/kernel/1512.1/03504.html)


--
Sander


On 2015-12-02 18:30, Sander Eikelenboom wrote:

On 2015-12-02 15:55, David Vrabel wrote:
> On 28/11/15 15:47, Sander Eikelenboom wrote:
>> genirq: Flags mismatch irq 8.  (hvc_console) vs. 
>> (rtc0)
>
> We shouldn't register an rtc_cmos device because its legacy irq
> conflicts with the irq needed for hvc0.  For a multi VCPU guest irq 8
> is
> in use for the pv spinlocks and this gets requested first, preventing
> the rtc device from probing.
>
> Does this patch fix it for you?
>
> David

It does, thanks.

Reported-and-tested-by: Sander Eikelenboom <li...@eikelenboom.it>

--
Sander

> 8<
> x86: rtc_cmos platform device requires legacy irqs
>
> Adding the rtc platform device when there are no legacy irqs (no
> legacy PIC) causes a conflict with other devices that end up using the
> same irq number.
>
> In a single VCPU PV guest we should have:
>
> /proc/interrupts:
>CPU0
>   0:   4934  xen-percpu-virq  timer0
>   1:  0  xen-percpu-ipi   spinlock0
>   2:  0  xen-percpu-ipi   resched0
>   3:  0  xen-percpu-ipi   callfunc0
>   4:  0  xen-percpu-virq  debug0
>   5:  0  xen-percpu-ipi   callfuncsingle0
>   6:  0  xen-percpu-ipi   irqwork0
>   7:321   xen-dyn-event xenbus
>   8: 90   xen-dyn-event hvc_console
>   ...
>
> But hvc_console cannot get its interrupt because it is already in use
> by rtc0 and the console does not work.
>
>   genirq: Flags mismatch irq 8.  (hvc_console) vs. 
> (rtc0)
>
> The rtc_cmos device requires a particular legacy irq so don't add it
> if there are no legacy irqs.
>
> Signed-off-by: David Vrabel <david.vra...@citrix.com>
> ---
>  arch/x86/kernel/rtc.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/kernel/rtc.c b/arch/x86/kernel/rtc.c
> index cd96852..07c70f1 100644
> --- a/arch/x86/kernel/rtc.c
> +++ b/arch/x86/kernel/rtc.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #ifdef CONFIG_X86_32
>  /*
> @@ -200,6 +201,10 @@ static __init int add_rtc_cmos(void)
>   }
>  #endif
>
> + /* RTC uses legacy IRQs. */
> + if (!nr_legacy_irqs())
> + return -ENODEV;
> +
>   platform_device_register(_device);
>   dev_info(_device.dev,
>"registered platform RTC device (no PNP device

found)\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] x86: Xen PV guests don't have the rtc_cmos platform device

2015-12-09 Thread Sander Eikelenboom

On 2015-12-09 15:42, Jan Beulich wrote:

On 09.12.15 at 15:32,  wrote:

--- a/arch/x86/kernel/rtc.c
+++ b/arch/x86/kernel/rtc.c
@@ -200,6 +200,9 @@ static __init int add_rtc_cmos(void)
}
 #endif

+   if (paravirt_enabled())
+   return -ENODEV;


What about Xen Dom0?

Jan


Checked that in my testing and that still worked:
[   16.733837] rtc_cmos 00:02: RTC can wake from S4
[   16.734030] rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
[   16.734087] rtc_cmos 00:02: alarms up to one month, y3k, 114 bytes 
nvram
[   17.760329] rtc_cmos 00:02: setting system clock to 2015-12-09 
08:43:48 UTC (1449650628)


and /dev/rtc and /dev/rtc0 both exist.

But i don't know the nitty gritty details about why ...
--
Sander
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] x86: Xen PV guests don't have the rtc_cmos platform device

2015-12-09 Thread Sander Eikelenboom

On 2015-12-09 15:42, Jan Beulich wrote:

On 09.12.15 at 15:32,  wrote:

--- a/arch/x86/kernel/rtc.c
+++ b/arch/x86/kernel/rtc.c
@@ -200,6 +200,9 @@ static __init int add_rtc_cmos(void)
}
 #endif

+   if (paravirt_enabled())
+   return -ENODEV;


What about Xen Dom0?

Jan


Checked that in my testing and that still worked:
[   16.733837] rtc_cmos 00:02: RTC can wake from S4
[   16.734030] rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
[   16.734087] rtc_cmos 00:02: alarms up to one month, y3k, 114 bytes 
nvram
[   17.760329] rtc_cmos 00:02: setting system clock to 2015-12-09 
08:43:48 UTC (1449650628)


and /dev/rtc and /dev/rtc0 both exist.

But i don't know the nitty gritty details about why ...
--
Sander
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-02 Thread Sander Eikelenboom

On 2015-12-02 15:55, David Vrabel wrote:

On 28/11/15 15:47, Sander Eikelenboom wrote:
genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)


We shouldn't register an rtc_cmos device because its legacy irq
conflicts with the irq needed for hvc0.  For a multi VCPU guest irq 8 
is

in use for the pv spinlocks and this gets requested first, preventing
the rtc device from probing.

Does this patch fix it for you?

David


It does, thanks.

Reported-and-tested-by: Sander Eikelenboom 

--
Sander


8<
x86: rtc_cmos platform device requires legacy irqs

Adding the rtc platform device when there are no legacy irqs (no
legacy PIC) causes a conflict with other devices that end up using the
same irq number.

In a single VCPU PV guest we should have:

/proc/interrupts:
   CPU0
  0:   4934  xen-percpu-virq  timer0
  1:  0  xen-percpu-ipi   spinlock0
  2:  0  xen-percpu-ipi   resched0
  3:  0  xen-percpu-ipi   callfunc0
  4:  0  xen-percpu-virq  debug0
  5:  0  xen-percpu-ipi   callfuncsingle0
  6:  0  xen-percpu-ipi   irqwork0
  7:321   xen-dyn-event xenbus
  8: 90   xen-dyn-event hvc_console
  ...

But hvc_console cannot get its interrupt because it is already in use
by rtc0 and the console does not work.

  genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)


The rtc_cmos device requires a particular legacy irq so don't add it
if there are no legacy irqs.

Signed-off-by: David Vrabel 
---
 arch/x86/kernel/rtc.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/rtc.c b/arch/x86/kernel/rtc.c
index cd96852..07c70f1 100644
--- a/arch/x86/kernel/rtc.c
+++ b/arch/x86/kernel/rtc.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 

 #ifdef CONFIG_X86_32
 /*
@@ -200,6 +201,10 @@ static __init int add_rtc_cmos(void)
}
 #endif

+   /* RTC uses legacy IRQs. */
+   if (!nr_legacy_irqs())
+   return -ENODEV;
+
platform_device_register(_device);
dev_info(_device.dev,
 "registered platform RTC device (no PNP device found)\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-02 Thread Sander Eikelenboom

On 2015-12-02 00:41, Boris Ostrovsky wrote:

On 12/01/2015 06:30 PM, Sander Eikelenboom wrote:

On 2015-12-02 00:19, Boris Ostrovsky wrote:

On 12/01/2015 06:00 PM, Sander Eikelenboom wrote:

On 2015-12-01 23:47, Boris Ostrovsky wrote:

On 11/30/2015 05:55 PM, Sander Eikelenboom wrote:

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:
On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom 
wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the 
tip tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus 
goes well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it 
seems a kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R 16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting 
would probably
quite painful since there were some breakages this merge 
window with respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the 
latest set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second 
fixes

a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was 
using (4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the 
last i'm troubled by for now ;)


I could not reproduce this, including with your kernel config file.


Hmm that's unpleasant :-\

Hmm other strange thing is it doesn't seem to affect dom0 (which is 
also a PV guest), but only unprivileged ones
All unprivileged pv-guests seem to have the irq issue, but only with 
a single vcpu i see to get the stuck kworker thread that got my 
attention, with a 2 vcpu that doesn't seem to happen, but you still 
get the dmesg output and warnings about hvc)


Could it be that:

arch/x86/include/asm/i8259.h
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}

returns something different in some circumstances ?


It should return 16 pre-8c058b0b9c34d8c8d7912880956543769323e2d8 and 
0

after that commit.

This is the last number that you see in
NR_IRQS:4352 nr_irqs:48 0
line.

I think you should be able to safely revert both
b4ff8389ed14b849354b59ce9b360bdefcdbf99c and
8c058b0b9c34d8c8d7912880956543769323e2d8 and see if it makes any
difference.


-boris



That was already underway compiling :)

And it does reveal that reverting both fixes the issue, no stuck 
kworker thread .. and no:
   genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

   hvc_open: request_irq failed with rc -16.



Let me try it again tomorrow. Can you post your guest config file, Xen
version and host HW (Intel or AMD)? 'xl info' maybe?

-boris


Hi Boris,

A fresh new day .. a fresh new thought.
If i look at the /proc/interrupts from a broken and a kernel with both 
commits the
thing that catches the eye is irq8, just as the dmesg message was 
telling.


In my PV guest rtc0 now seems to try and take irq8 that was already 
assigned to HVC ?
Sounds like some assumptions around the legacy range are broken 
somewhere.


What is the benefit of not just reserving the legacy range ?

Attached the /proc/interrupts from both boots.

--
Sander






What i did get was an conflict reverting 
b4ff8389ed14b849354b59ce9b360bdefcdbf99c:
arch/arm64/include/asm/irq.h, although that shouldn't matter because 
we are on x86 and not on arm.


-- Sander




-- Sander



-boris


___
Xen-devel mailing list
xen-de...@lists.xen.org
http://lists.xen.org/xen-devel   CPU0   
 16: 315536  xen-percpu-virq  timer0
 17:  0  xen-percpu-ipi   spinlock0
 18:  0  xen-percpu-ipi   resched0
 19:  0  xen-percpu-ipi   callfunc0
 20:  0  xen-percpu-virq  debug0
 21:  0  xen-percpu-ipi   callfuncsingle0
 22:  0  xen-percpu-ipi   irqwork0
 23:346   xen-dyn-event xenbus
 24:134   xen-dyn-event hvc_console
 25:  11464   xen-dyn-event blkif
 26:  28710   xen-dyn-event eth0-q0-tx
 27:  40136   xen-dyn-event eth0-q0-rx
NMI:  0   Non-maskable interrupts
LOC:  0   Local timer interrupts
SPU:  0   Spurious interrupts
PMI:  0   Performance monitoring interrupts
IWI:  0   IRQ work interrupts
RTR:  0   APIC ICR read retries
RES:  0   Resche

Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-02 Thread Sander Eikelenboom

On 2015-12-02 15:55, David Vrabel wrote:

On 28/11/15 15:47, Sander Eikelenboom wrote:
genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)


We shouldn't register an rtc_cmos device because its legacy irq
conflicts with the irq needed for hvc0.  For a multi VCPU guest irq 8 
is

in use for the pv spinlocks and this gets requested first, preventing
the rtc device from probing.

Does this patch fix it for you?

David


It does, thanks.

Reported-and-tested-by: Sander Eikelenboom <li...@eikelenboom.it>

--
Sander


8<
x86: rtc_cmos platform device requires legacy irqs

Adding the rtc platform device when there are no legacy irqs (no
legacy PIC) causes a conflict with other devices that end up using the
same irq number.

In a single VCPU PV guest we should have:

/proc/interrupts:
   CPU0
  0:   4934  xen-percpu-virq  timer0
  1:  0  xen-percpu-ipi   spinlock0
  2:  0  xen-percpu-ipi   resched0
  3:  0  xen-percpu-ipi   callfunc0
  4:  0  xen-percpu-virq  debug0
  5:  0  xen-percpu-ipi   callfuncsingle0
  6:  0  xen-percpu-ipi   irqwork0
  7:321   xen-dyn-event xenbus
  8: 90   xen-dyn-event hvc_console
  ...

But hvc_console cannot get its interrupt because it is already in use
by rtc0 and the console does not work.

  genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)


The rtc_cmos device requires a particular legacy irq so don't add it
if there are no legacy irqs.

Signed-off-by: David Vrabel <david.vra...@citrix.com>
---
 arch/x86/kernel/rtc.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/rtc.c b/arch/x86/kernel/rtc.c
index cd96852..07c70f1 100644
--- a/arch/x86/kernel/rtc.c
+++ b/arch/x86/kernel/rtc.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 

 #ifdef CONFIG_X86_32
 /*
@@ -200,6 +201,10 @@ static __init int add_rtc_cmos(void)
}
 #endif

+   /* RTC uses legacy IRQs. */
+   if (!nr_legacy_irqs())
+   return -ENODEV;
+
platform_device_register(_device);
dev_info(_device.dev,
 "registered platform RTC device (no PNP device found)\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-02 Thread Sander Eikelenboom

On 2015-12-02 00:41, Boris Ostrovsky wrote:

On 12/01/2015 06:30 PM, Sander Eikelenboom wrote:

On 2015-12-02 00:19, Boris Ostrovsky wrote:

On 12/01/2015 06:00 PM, Sander Eikelenboom wrote:

On 2015-12-01 23:47, Boris Ostrovsky wrote:

On 11/30/2015 05:55 PM, Sander Eikelenboom wrote:

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:
On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom 
wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the 
tip tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus 
goes well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it 
seems a kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R 16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting 
would probably
quite painful since there were some breakages this merge 
window with respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the 
latest set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second 
fixes

a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was 
using (4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the 
last i'm troubled by for now ;)


I could not reproduce this, including with your kernel config file.


Hmm that's unpleasant :-\

Hmm other strange thing is it doesn't seem to affect dom0 (which is 
also a PV guest), but only unprivileged ones
All unprivileged pv-guests seem to have the irq issue, but only with 
a single vcpu i see to get the stuck kworker thread that got my 
attention, with a 2 vcpu that doesn't seem to happen, but you still 
get the dmesg output and warnings about hvc)


Could it be that:

arch/x86/include/asm/i8259.h
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}

returns something different in some circumstances ?


It should return 16 pre-8c058b0b9c34d8c8d7912880956543769323e2d8 and 
0

after that commit.

This is the last number that you see in
NR_IRQS:4352 nr_irqs:48 0
line.

I think you should be able to safely revert both
b4ff8389ed14b849354b59ce9b360bdefcdbf99c and
8c058b0b9c34d8c8d7912880956543769323e2d8 and see if it makes any
difference.


-boris



That was already underway compiling :)

And it does reveal that reverting both fixes the issue, no stuck 
kworker thread .. and no:
   genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

   hvc_open: request_irq failed with rc -16.



Let me try it again tomorrow. Can you post your guest config file, Xen
version and host HW (Intel or AMD)? 'xl info' maybe?

-boris


Hi Boris,

A fresh new day .. a fresh new thought.
If i look at the /proc/interrupts from a broken and a kernel with both 
commits the
thing that catches the eye is irq8, just as the dmesg message was 
telling.


In my PV guest rtc0 now seems to try and take irq8 that was already 
assigned to HVC ?
Sounds like some assumptions around the legacy range are broken 
somewhere.


What is the benefit of not just reserving the legacy range ?

Attached the /proc/interrupts from both boots.

--
Sander






What i did get was an conflict reverting 
b4ff8389ed14b849354b59ce9b360bdefcdbf99c:
arch/arm64/include/asm/irq.h, although that shouldn't matter because 
we are on x86 and not on arm.


-- Sander




-- Sander



-boris


___
Xen-devel mailing list
xen-de...@lists.xen.org
http://lists.xen.org/xen-devel   CPU0   
 16: 315536  xen-percpu-virq  timer0
 17:  0  xen-percpu-ipi   spinlock0
 18:  0  xen-percpu-ipi   resched0
 19:  0  xen-percpu-ipi   callfunc0
 20:  0  xen-percpu-virq  debug0
 21:  0  xen-percpu-ipi   callfuncsingle0
 22:  0  xen-percpu-ipi   irqwork0
 23:346   xen-dyn-event xenbus
 24:134   xen-dyn-event hvc_console
 25:  11464   xen-dyn-event blkif
 26:  28710   xen-dyn-event eth0-q0-tx
 27:  40136   xen-dyn-event eth0-q0-rx
NMI:  0   Non-maskable interrupts
LOC:  0   Local timer interrupts
SPU:  0   Spurious interrupts
PMI:  0   Performance monitoring interrupts
IWI:  0   IRQ work interrupts
RTR:  0   APIC ICR read retries
RES:  0   Resche

Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-01 Thread Sander Eikelenboom

On 2015-12-02 00:41, Boris Ostrovsky wrote:

On 12/01/2015 06:30 PM, Sander Eikelenboom wrote:

On 2015-12-02 00:19, Boris Ostrovsky wrote:

On 12/01/2015 06:00 PM, Sander Eikelenboom wrote:

On 2015-12-01 23:47, Boris Ostrovsky wrote:

On 11/30/2015 05:55 PM, Sander Eikelenboom wrote:

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:
On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom 
wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the 
tip tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus 
goes well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it 
seems a kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R 16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting 
would probably
quite painful since there were some breakages this merge 
window with respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the 
latest set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second 
fixes

a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was 
using (4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the 
last i'm troubled by for now ;)


I could not reproduce this, including with your kernel config file.


Hmm that's unpleasant :-\

Hmm other strange thing is it doesn't seem to affect dom0 (which is 
also a PV guest), but only unprivileged ones
All unprivileged pv-guests seem to have the irq issue, but only with 
a single vcpu i see to get the stuck kworker thread that got my 
attention, with a 2 vcpu that doesn't seem to happen, but you still 
get the dmesg output and warnings about hvc)


Could it be that:

arch/x86/include/asm/i8259.h
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}

returns something different in some circumstances ?


It should return 16 pre-8c058b0b9c34d8c8d7912880956543769323e2d8 and 
0

after that commit.

This is the last number that you see in
NR_IRQS:4352 nr_irqs:48 0
line.

I think you should be able to safely revert both
b4ff8389ed14b849354b59ce9b360bdefcdbf99c and
8c058b0b9c34d8c8d7912880956543769323e2d8 and see if it makes any
difference.


-boris



That was already underway compiling :)

And it does reveal that reverting both fixes the issue, no stuck 
kworker thread .. and no:
   genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

   hvc_open: request_irq failed with rc -16.



Let me try it again tomorrow. Can you post your guest config file, Xen
version and host HW (Intel or AMD)? 'xl info' maybe?

-boris


Guest config file == dom0 config file == the one i send you earlier.
Host is an AMD Phenom X6.

# xl info
host   : serveerstertje
release: 4.4.0-rc3-20151201-linus-doflr-boris+
version: #1 SMP Tue Dec 1 19:02:58 CET 2015
machine: x86_64
nr_cpus: 6
max_cpu_id : 5
nr_nodes   : 1
cores_per_socket   : 6
threads_per_core   : 1
cpu_mhz: 3200
hw_caps: 
178bf3ff:efd3fbff::00011300:00802001::37ff:

virt_caps  : hvm hvm_directio
total_memory   : 20479
free_memory: 7745
sharing_freed_memory   : 0
sharing_used_memory: 0
outstanding_claims : 0
free_cpus  : 0
xen_major  : 4
xen_minor  : 7
xen_extra  : -unstable
xen_version: 4.7-unstable
xen_caps   : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
hvm-3.0-x86_32p hvm-3.0-x86_64

xen_scheduler  : credit
xen_pagesize   : 4096
platform_params: virt_start=0x8000
xen_changeset  : Thu Nov 26 20:58:13 2015 +0100 
git:5252636-dirty
xen_commandline: dom0_mem=1536M,max:1536M loglvl=all 
loglvl_guest=all console_timestamps=datems vga=gfx-1280x1024x32 cpuidle 
cpufreq=xen com1=38400,8n1 console=vga,com1 ivrs_ioapic[6]=00:14.0 
iommu=on,verbose,debug,amd-iommu-debug conring_size=128k ucode=-1

cc_compiler: gcc-4.9.real (Debian 4.9.2-10) 4.9.2
cc_compile_by  : root
cc_compile_domain  : dyndns.org
cc_compile_date: Thu Nov 26 21:18:41 CET 2015
xend_config_format : 4

If you need and can get mor

Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-01 Thread Sander Eikelenboom

On 2015-12-02 00:19, Boris Ostrovsky wrote:

On 12/01/2015 06:00 PM, Sander Eikelenboom wrote:

On 2015-12-01 23:47, Boris Ostrovsky wrote:

On 11/30/2015 05:55 PM, Sander Eikelenboom wrote:

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:
On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom 
wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the 
tip tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus 
goes well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it 
seems a kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R 16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting 
would probably
quite painful since there were some breakages this merge window 
with respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the 
latest set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second 
fixes

a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was 
using (4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the 
last i'm troubled by for now ;)


I could not reproduce this, including with your kernel config file.


Hmm that's unpleasant :-\

Hmm other strange thing is it doesn't seem to affect dom0 (which is 
also a PV guest), but only unprivileged ones
All unprivileged pv-guests seem to have the irq issue, but only with a 
single vcpu i see to get the stuck kworker thread that got my 
attention, with a 2 vcpu that doesn't seem to happen, but you still 
get the dmesg output and warnings about hvc)


Could it be that:

arch/x86/include/asm/i8259.h
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}

returns something different in some circumstances ?


It should return 16 pre-8c058b0b9c34d8c8d7912880956543769323e2d8 and 0
after that commit.

This is the last number that you see in
NR_IRQS:4352 nr_irqs:48 0
line.

I think you should be able to safely revert both
b4ff8389ed14b849354b59ce9b360bdefcdbf99c and
8c058b0b9c34d8c8d7912880956543769323e2d8 and see if it makes any
difference.


-boris



That was already underway compiling :)

And it does reveal that reverting both fixes the issue, no stuck kworker 
thread .. and no:
   genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

   hvc_open: request_irq failed with rc -16.

What i did get was an conflict reverting 
b4ff8389ed14b849354b59ce9b360bdefcdbf99c:
arch/arm64/include/asm/irq.h, although that shouldn't matter because we 
are on x86 and not on arm.


--
Sander




-- Sander



-boris


___
Xen-devel mailing list
xen-de...@lists.xen.org
http://lists.xen.org/xen-devel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-01 Thread Sander Eikelenboom

On 2015-12-01 23:47, Boris Ostrovsky wrote:

On 11/30/2015 05:55 PM, Sander Eikelenboom wrote:

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:

On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the tip 
tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus 
goes well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it seems 
a kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting 
would probably
quite painful since there were some breakages this merge window 
with respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the latest 
set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second 
fixes

a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was using 
(4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the 
last i'm troubled by for now ;)


I could not reproduce this, including with your kernel config file.


Hmm that's unpleasant :-\

Hmm other strange thing is it doesn't seem to affect dom0 (which is also 
a PV guest), but only unprivileged ones
All unprivileged pv-guests seem to have the irq issue, but only with a 
single vcpu i see to get the stuck kworker thread that got my attention, 
with a 2 vcpu that doesn't seem to happen, but you still get the dmesg 
output and warnings about hvc)


Could it be that:

arch/x86/include/asm/i8259.h
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}

returns something different in some circumstances ?

--
Sander



-boris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-01 Thread Sander Eikelenboom

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:

On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the tip 
tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus goes 
well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it seems a 
kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting would 
probably
quite painful since there were some breakages this merge window with 
respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the latest 
set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second fixes
a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.




Thanks :)

-- Sander


Between 4.3 and 4.4-single:

-NR_IRQS:4352 nr_irqs:32 16
+Using NULL legacy PIC
+NR_IRQS:4352 nr_irqs:32 0


This is fine, as long as you have 
b4ff8389ed14b849354b59ce9b360bdefcdbf99c.




-cpu 0 spinlock event irq 17
+cpu 0 spinlock event irq 1


This is strange. I wouldn't expect spinlocks to use legacy irqs.



Could it be .. that with your fixup:
xen/events: Always allocate legacy interrupts on PV guests
(b4ff8389ed14b849354b59ce9b360bdefcdbf99c)
for commit:
x86/irq: Probe for PIC presence before allocating descs for legacy 
IRQs

(8c058b0b9c34d8c8d7912880956543769323e2d8)

that we now have the situation described in the commit message of 
8c058b0b9c, but now for Xen PV instead of

Hyper-V ?
(seems both Xen and Hyper-V want to achieve the same but have different 
competing implementations ?)


(BTW 8c058b0b9c has a CC for stable ... so could be destined to cause 
more trouble).


--
Sander




and later on:

-hctosys: unable to open rtc device (rtc0)
+rtc_cmos rtc_cmos: hctosys: unable to read the hardware clock

+genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

+hvc_open: request_irq failed with rc -16.
+Warning: unable to open an initial console.


between 4.4-single and 4.4-multi:

 Using NULL legacy PIC
-NR_IRQS:4352 nr_irqs:32 0
+NR_IRQS:4352 nr_irqs:48 0


This is probably OK too since nr_irqs depend on number of CPUs.

I think something is messed up with IRQ. I saw last week something
from setup_irq() generating a stack dump (warninig) for rtc_cmos but
it appeared harmless at that time and now I don't see it anymore.

-boris




and later on:

-rtc_cmos rtc_cmos: hctosys: unable to read the hardware clock
+hctosys: unable to open rtc device (rtc0)

-genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

-hvc_open: request_irq failed with rc -16.
-Warning: unable to open an initial console.

attached:
- dmesg with 4.3 kernel with 1 vcpu
- dmesg with 4.4 kernel with 1 vpcu
- dmesg with 4.4 kernel with 2 vpcus
- .config of the 4.4 kernel is attached.

-- Sander



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-01 Thread Sander Eikelenboom

On 2015-12-01 23:47, Boris Ostrovsky wrote:

On 11/30/2015 05:55 PM, Sander Eikelenboom wrote:

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:

On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the tip 
tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus 
goes well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it seems 
a kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting 
would probably
quite painful since there were some breakages this merge window 
with respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the latest 
set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second 
fixes

a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was using 
(4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the 
last i'm troubled by for now ;)


I could not reproduce this, including with your kernel config file.


Hmm that's unpleasant :-\

Hmm other strange thing is it doesn't seem to affect dom0 (which is also 
a PV guest), but only unprivileged ones
All unprivileged pv-guests seem to have the irq issue, but only with a 
single vcpu i see to get the stuck kworker thread that got my attention, 
with a 2 vcpu that doesn't seem to happen, but you still get the dmesg 
output and warnings about hvc)


Could it be that:

arch/x86/include/asm/i8259.h
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}

returns something different in some circumstances ?

--
Sander



-boris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-01 Thread Sander Eikelenboom

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:

On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the tip 
tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus goes 
well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it seems a 
kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting would 
probably
quite painful since there were some breakages this merge window with 
respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the latest 
set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second fixes
a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.




Thanks :)

-- Sander


Between 4.3 and 4.4-single:

-NR_IRQS:4352 nr_irqs:32 16
+Using NULL legacy PIC
+NR_IRQS:4352 nr_irqs:32 0


This is fine, as long as you have 
b4ff8389ed14b849354b59ce9b360bdefcdbf99c.




-cpu 0 spinlock event irq 17
+cpu 0 spinlock event irq 1


This is strange. I wouldn't expect spinlocks to use legacy irqs.



Could it be .. that with your fixup:
xen/events: Always allocate legacy interrupts on PV guests
(b4ff8389ed14b849354b59ce9b360bdefcdbf99c)
for commit:
x86/irq: Probe for PIC presence before allocating descs for legacy 
IRQs

(8c058b0b9c34d8c8d7912880956543769323e2d8)

that we now have the situation described in the commit message of 
8c058b0b9c, but now for Xen PV instead of

Hyper-V ?
(seems both Xen and Hyper-V want to achieve the same but have different 
competing implementations ?)


(BTW 8c058b0b9c has a CC for stable ... so could be destined to cause 
more trouble).


--
Sander




and later on:

-hctosys: unable to open rtc device (rtc0)
+rtc_cmos rtc_cmos: hctosys: unable to read the hardware clock

+genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

+hvc_open: request_irq failed with rc -16.
+Warning: unable to open an initial console.


between 4.4-single and 4.4-multi:

 Using NULL legacy PIC
-NR_IRQS:4352 nr_irqs:32 0
+NR_IRQS:4352 nr_irqs:48 0


This is probably OK too since nr_irqs depend on number of CPUs.

I think something is messed up with IRQ. I saw last week something
from setup_irq() generating a stack dump (warninig) for rtc_cmos but
it appeared harmless at that time and now I don't see it anymore.

-boris




and later on:

-rtc_cmos rtc_cmos: hctosys: unable to read the hardware clock
+hctosys: unable to open rtc device (rtc0)

-genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

-hvc_open: request_irq failed with rc -16.
-Warning: unable to open an initial console.

attached:
- dmesg with 4.3 kernel with 1 vcpu
- dmesg with 4.4 kernel with 1 vpcu
- dmesg with 4.4 kernel with 2 vpcus
- .config of the 4.4 kernel is attached.

-- Sander



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-01 Thread Sander Eikelenboom

On 2015-12-02 00:19, Boris Ostrovsky wrote:

On 12/01/2015 06:00 PM, Sander Eikelenboom wrote:

On 2015-12-01 23:47, Boris Ostrovsky wrote:

On 11/30/2015 05:55 PM, Sander Eikelenboom wrote:

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:
On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom 
wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the 
tip tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus 
goes well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it 
seems a kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R 16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting 
would probably
quite painful since there were some breakages this merge window 
with respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the 
latest set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second 
fixes

a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was 
using (4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the 
last i'm troubled by for now ;)


I could not reproduce this, including with your kernel config file.


Hmm that's unpleasant :-\

Hmm other strange thing is it doesn't seem to affect dom0 (which is 
also a PV guest), but only unprivileged ones
All unprivileged pv-guests seem to have the irq issue, but only with a 
single vcpu i see to get the stuck kworker thread that got my 
attention, with a 2 vcpu that doesn't seem to happen, but you still 
get the dmesg output and warnings about hvc)


Could it be that:

arch/x86/include/asm/i8259.h
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}

returns something different in some circumstances ?


It should return 16 pre-8c058b0b9c34d8c8d7912880956543769323e2d8 and 0
after that commit.

This is the last number that you see in
NR_IRQS:4352 nr_irqs:48 0
line.

I think you should be able to safely revert both
b4ff8389ed14b849354b59ce9b360bdefcdbf99c and
8c058b0b9c34d8c8d7912880956543769323e2d8 and see if it makes any
difference.


-boris



That was already underway compiling :)

And it does reveal that reverting both fixes the issue, no stuck kworker 
thread .. and no:
   genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

   hvc_open: request_irq failed with rc -16.

What i did get was an conflict reverting 
b4ff8389ed14b849354b59ce9b360bdefcdbf99c:
arch/arm64/include/asm/irq.h, although that shouldn't matter because we 
are on x86 and not on arm.


--
Sander




-- Sander



-boris


___
Xen-devel mailing list
xen-de...@lists.xen.org
http://lists.xen.org/xen-devel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-12-01 Thread Sander Eikelenboom

On 2015-12-02 00:41, Boris Ostrovsky wrote:

On 12/01/2015 06:30 PM, Sander Eikelenboom wrote:

On 2015-12-02 00:19, Boris Ostrovsky wrote:

On 12/01/2015 06:00 PM, Sander Eikelenboom wrote:

On 2015-12-01 23:47, Boris Ostrovsky wrote:

On 11/30/2015 05:55 PM, Sander Eikelenboom wrote:

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:
On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom 
wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the 
tip tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus 
goes well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it 
seems a kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R 16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting 
would probably
quite painful since there were some breakages this merge 
window with respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the 
latest set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second 
fixes

a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was 
using (4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the 
last i'm troubled by for now ;)


I could not reproduce this, including with your kernel config file.


Hmm that's unpleasant :-\

Hmm other strange thing is it doesn't seem to affect dom0 (which is 
also a PV guest), but only unprivileged ones
All unprivileged pv-guests seem to have the irq issue, but only with 
a single vcpu i see to get the stuck kworker thread that got my 
attention, with a 2 vcpu that doesn't seem to happen, but you still 
get the dmesg output and warnings about hvc)


Could it be that:

arch/x86/include/asm/i8259.h
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}

returns something different in some circumstances ?


It should return 16 pre-8c058b0b9c34d8c8d7912880956543769323e2d8 and 
0

after that commit.

This is the last number that you see in
NR_IRQS:4352 nr_irqs:48 0
line.

I think you should be able to safely revert both
b4ff8389ed14b849354b59ce9b360bdefcdbf99c and
8c058b0b9c34d8c8d7912880956543769323e2d8 and see if it makes any
difference.


-boris



That was already underway compiling :)

And it does reveal that reverting both fixes the issue, no stuck 
kworker thread .. and no:
   genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

   hvc_open: request_irq failed with rc -16.



Let me try it again tomorrow. Can you post your guest config file, Xen
version and host HW (Intel or AMD)? 'xl info' maybe?

-boris


Guest config file == dom0 config file == the one i send you earlier.
Host is an AMD Phenom X6.

# xl info
host   : serveerstertje
release: 4.4.0-rc3-20151201-linus-doflr-boris+
version: #1 SMP Tue Dec 1 19:02:58 CET 2015
machine: x86_64
nr_cpus: 6
max_cpu_id : 5
nr_nodes   : 1
cores_per_socket   : 6
threads_per_core   : 1
cpu_mhz: 3200
hw_caps: 
178bf3ff:efd3fbff::00011300:00802001::37ff:

virt_caps  : hvm hvm_directio
total_memory   : 20479
free_memory: 7745
sharing_freed_memory   : 0
sharing_used_memory: 0
outstanding_claims : 0
free_cpus  : 0
xen_major  : 4
xen_minor  : 7
xen_extra  : -unstable
xen_version: 4.7-unstable
xen_caps   : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
hvm-3.0-x86_32p hvm-3.0-x86_64

xen_scheduler  : credit
xen_pagesize   : 4096
platform_params: virt_start=0x8000
xen_changeset  : Thu Nov 26 20:58:13 2015 +0100 
git:5252636-dirty
xen_commandline: dom0_mem=1536M,max:1536M loglvl=all 
loglvl_guest=all console_timestamps=datems vga=gfx-1280x1024x32 cpuidle 
cpufreq=xen com1=38400,8n1 console=vga,com1 ivrs_ioapic[6]=00:14.0 
iommu=on,verbose,debug,amd-iommu-debug conring_size=128k ucode=-1

cc_compiler: gcc-4.9.real (Debian 4.9.2-10) 4.9.2
cc_compile_by  : root
cc_compile_domain  : dyndns.org
cc_compile_date: Thu Nov 26 21:18:41 CET 2015
xend_config_format : 4

If you need and can get mor

Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-11-30 Thread Sander Eikelenboom

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:

On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the tip 
tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus goes 
well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it seems a 
kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting would 
probably
quite painful since there were some breakages this merge window with 
respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the latest 
set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second fixes
a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was using 
(4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the last 
i'm troubled by for now ;)


--
Sander





Thanks :)

-- Sander


Between 4.3 and 4.4-single:

-NR_IRQS:4352 nr_irqs:32 16
+Using NULL legacy PIC
+NR_IRQS:4352 nr_irqs:32 0


This is fine, as long as you have 
b4ff8389ed14b849354b59ce9b360bdefcdbf99c.




-cpu 0 spinlock event irq 17
+cpu 0 spinlock event irq 1


This is strange. I wouldn't expect spinlocks to use legacy irqs.



and later on:

-hctosys: unable to open rtc device (rtc0)
+rtc_cmos rtc_cmos: hctosys: unable to read the hardware clock

+genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

+hvc_open: request_irq failed with rc -16.
+Warning: unable to open an initial console.


between 4.4-single and 4.4-multi:

 Using NULL legacy PIC
-NR_IRQS:4352 nr_irqs:32 0
+NR_IRQS:4352 nr_irqs:48 0


This is probably OK too since nr_irqs depend on number of CPUs.

I think something is messed up with IRQ. I saw last week something
from setup_irq() generating a stack dump (warninig) for rtc_cmos but
it appeared harmless at that time and now I don't see it anymore.

-boris




and later on:

-rtc_cmos rtc_cmos: hctosys: unable to read the hardware clock
+hctosys: unable to open rtc device (rtc0)

-genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

-hvc_open: request_irq failed with rc -16.
-Warning: unable to open an initial console.

attached:
- dmesg with 4.3 kernel with 1 vcpu
- dmesg with 4.4 kernel with 1 vpcu
- dmesg with 4.4 kernel with 2 vpcus
- .config of the 4.4 kernel is attached.

-- Sander



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] linux 4.4 Regression: 100% cpu usage on idle pv guest under Xen with single vcpu.

2015-11-30 Thread Sander Eikelenboom

On 2015-11-30 23:54, Boris Ostrovsky wrote:

On 11/30/2015 04:46 PM, Sander Eikelenboom wrote:

On 2015-11-30 22:45, Konrad Rzeszutek Wilk wrote:

On Sat, Nov 28, 2015 at 04:47:43PM +0100, Sander Eikelenboom wrote:

Hi all,

I have just tested a 4.4-rc2 kernel (current linus tree) + the tip 
tree

pulled on top.

Running this kernel under Xen on PV-guests with multiple vcpus goes 
well (on

idle < 10% cpu usage),
but a guest with only a single vcpu doesn't idle at all, it seems a 
kworker

thread is stuck:
root   569 98.0  0.0  0 0 ?R16:02 12:47
[kworker/0:1]

Running a 4.3 kernel works fine with a single vpcu, bisecting would 
probably
quite painful since there were some breakages this merge window with 
respect

to Xen pv-guests.

There are some differences in the diff's from booting a 4.3, 
4.4-single,

4.4-multi cpu boot:


Boris has been tracking a bunch of them. I am attaching the latest 
set of

patches I've to carry on top of v4.4-rc3.


Hi Konrad,

i will test those, see if it fixes all my issues and report back


They shouldn't help you ;-( (and I just saw a message from you 
confirming this)


The first one fixes a 32-bit bug (on bare metal too). The second fixes
a fatal bug for 32-bit PV guests. The other two are code
improvements/cleanup.


One of these patches also fixes a bug i was having with a 
pci-passthrough device in
a HVM that wasn't working (depending on which dom0-kernel i was using 
(4.3 or 4.4)),

but didn't report yet.

Fingers crossed but i think this pv-guest single vcpu issue is the last 
i'm troubled by for now ;)


--
Sander





Thanks :)

-- Sander


Between 4.3 and 4.4-single:

-NR_IRQS:4352 nr_irqs:32 16
+Using NULL legacy PIC
+NR_IRQS:4352 nr_irqs:32 0


This is fine, as long as you have 
b4ff8389ed14b849354b59ce9b360bdefcdbf99c.




-cpu 0 spinlock event irq 17
+cpu 0 spinlock event irq 1


This is strange. I wouldn't expect spinlocks to use legacy irqs.



and later on:

-hctosys: unable to open rtc device (rtc0)
+rtc_cmos rtc_cmos: hctosys: unable to read the hardware clock

+genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

+hvc_open: request_irq failed with rc -16.
+Warning: unable to open an initial console.


between 4.4-single and 4.4-multi:

 Using NULL legacy PIC
-NR_IRQS:4352 nr_irqs:32 0
+NR_IRQS:4352 nr_irqs:48 0


This is probably OK too since nr_irqs depend on number of CPUs.

I think something is messed up with IRQ. I saw last week something
from setup_irq() generating a stack dump (warninig) for rtc_cmos but
it appeared harmless at that time and now I don't see it anymore.

-boris




and later on:

-rtc_cmos rtc_cmos: hctosys: unable to read the hardware clock
+hctosys: unable to open rtc device (rtc0)

-genirq: Flags mismatch irq 8.  (hvc_console) vs.  
(rtc0)

-hvc_open: request_irq failed with rc -16.
-Warning: unable to open an initial console.

attached:
- dmesg with 4.3 kernel with 1 vcpu
- dmesg with 4.4 kernel with 1 vpcu
- dmesg with 4.4 kernel with 2 vpcus
- .config of the 4.4 kernel is attached.

-- Sander



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-4.4-mw] Regression: cx25821: Oops: no 32bit PCI DMA

2015-11-15 Thread Sander Eikelenboom

On 2015-11-15 13:56, Christoph Hellwig wrote:

Hi Saner,

this is my fault.  Please see the patch which I already sent out
to Andrew and lkml.


Hi Christoph,

Thanks for the pointer, just tested and it works fine again.

--
Sander
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-4.4-mw] Regression: cx25821: Oops: no 32bit PCI DMA

2015-11-15 Thread Sander Eikelenboom

On 2015-11-15 13:56, Christoph Hellwig wrote:

Hi Saner,

this is my fault.  Please see the patch which I already sent out
to Andrew and lkml.


Hi Christoph,

Thanks for the pointer, just tested and it works fine again.

--
Sander
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-05 Thread Sander Eikelenboom

Thursday, November 5, 2015, 2:53:40 PM, you wrote:

> On 11/05/2015 04:13 AM, Sander Eikelenboom wrote:
>>
>> It makes "cat /sys/kernel/debug/kernel_page_tables" work and
>> prevents a kernel with CONFIG_DEBUG_WX=y from crashing at boot.

> Great. Our nightly runs also failed spectacularly due to this bug.

>>
>> It now does give a warning about an insecure W+X mapping, so 
>> CONFIG_DEBUG_WX=y
>> seems to be working. No idea how to interpret it though (and if it's a 
>> legit
>> warning).
>>
>> -- 
>> Sander
>>
>> [   19.034706] Freeing unused kernel memory: 1104K (822fc000 - 
>> 8241)
>> [   19.041339] Write protecting the kernel read-only data: 18432k
>> [   19.052596] Freeing unused kernel memory: 1144K (880001ae2000 - 
>> 880001c0)
>> [   19.060285] Freeing unused kernel memory: 1560K (88000207a000 - 
>> 88000220)
>> [   19.067079] [ cut here ]
>> [   19.073931] WARNING: CPU: 5 PID: 1 at 
>> arch/x86/mm/dump_pagetables.c:225 note_page+0x619/0x7e0()

> Yes, this apparently is a known issue: https://lkml.org/lkml/2015/11/4/476

> -boris

Ah thx for the pointer :)

--
Sander




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-05 Thread Sander Eikelenboom

On 2015-11-05 00:13, Boris Ostrovsky wrote:

On 11/04/2015 03:02 PM, Sander Eikelenboom wrote:

On 2015-11-04 19:47, Stephen Smalley wrote:

On 11/04/2015 01:28 PM, Sander Eikelenboom wrote:

On 2015-11-04 16:52, Stephen Smalley wrote:

On 11/04/2015 06:55 AM, Sander Eikelenboom wrote:

Hi All,

I just tried to boot with the current linus mergewindow tree under 
Xen.
It fails with a kernel panic at boot with the new 
"CONFIG_DEBUG_WX"

option enabled.
Disabling it makes the kernel boot fine.

The splat:
[   18.424241] Freeing unused kernel memory: 1104K 
(822fc000 -

8241)
[   18.430314] Write protecting the kernel read-only data: 18432k
[   18.441054] Freeing unused kernel memory: 1144K 
(880001ae2000 -

880001c0)
[   18.447966] Freeing unused kernel memory: 1560K 
(88000207a000 -

88000220)
[   18.453947] BUG: unable to handle kernel paging request at
88055c883000
[   18.459943] IP: []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.465847] PGD 2212067 PUD 0
[   18.471564] Oops:  [#1] SMP
[   18.477248] Modules linked in:
[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
4.3.0-mw-20151104-linus-doflr+ #1
[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , 
BIOS

V1.8B1 09/13/2010
[   18.494778] task: 880059b9 ti: 880059b98000 
task.ti:

880059b98000
[   18.500852] RIP: e030:[] []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.507102] RSP: e02b:880059b9be48  EFLAGS: 00010296
[   18.513351] RAX: 88055c883000 RBX: 81ae2000 RCX:
8800
[   18.519733] RDX: 0067 RSI: 880059b9be98 RDI:
88001000
[   18.526129] RBP: 880059b9bf00 R08:  R09:

[   18.532522] R10: 88005fd0e790 R11: 0001 R12:
88008000
[   18.538891] R13: cfff R14: 880059b9be98 R15:

[   18.545247] FS:  () 
GS:88005f68()

knlGS:
[   18.551708] CS:  e033 DS:  ES:  CR0: 8005003b
[   18.558153] CR2: 88055c883000 CR3: 02211000 CR4:
0660
[   18.564686] Stack:
[   18.571106]  000159b9be50 82211000 88055c884000
0800
[   18.577704]  8000 88055c883000 0007
88005fd0e790
[   18.584291]  880059b9bed8 81156ace 0001

[   18.590916] Call Trace:
[   18.597458]  [] ? 
free_reserved_area+0x11e/0x120

[   18.604180]  []
ptdump_walk_pgd_level_checkwx+0x12/0x20
[   18.611014]  [] mark_rodata_ro+0xe9/0xf0
[   18.617819]  [] ? rest_init+0x80/0x80
[   18.624512]  [] kernel_init+0x18/0xe0
[   18.631095]  [] ret_from_fork+0x3f/0x70
[   18.637650]  [] ? rest_init+0x80/0x80
[   18.644178] Code: 70 ff ff ff 48 3b 85 58 ff ff ff 0f 84 c0 fe 
ff ff
48 8b 85 68 ff ff ff 48 c1 e0 10 48 c1 f8 10 48 89 45 b0 48 8b 85 
70 ff
ff ff <48> 8b 38 48 85 ff 0f 85 4e ff ff ff b9 02 00 00 00 31 d2 
4c 89

[   18.658246] RIP  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.665211]  RSP 
[   18.672073] CR2: 88055c883000
[   18.678852] ---[ end trace d84e34461c40637a ]---
[   18.685641] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0009
[   18.685641]
[   18.699520] Kernel Offset: disable



What's your .config?  Does cat /sys/kernel/debug/kernel_page_tables
produce a similar fault even with CONFIG_DEBUG_WX=n?


.config is attached

Hmm that sysfs file doesn't seem to exist then:
# cat /sys/kernel/debug/kernel_page_tables
cat: /sys/kernel/debug/kernel_page_tables: No such file or directory


Needs CONFIG_X86_PTDUMP=y.
Also assumes you have debugfs mounted there.


Recompiled, and the result is that it also blows up:



Can you try this:


diff --git a/arch/x86/mm/dump_pagetables.c 
b/arch/x86/mm/dump_pagetables.c

index 1bf417e..b534216 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -362,8 +362,13 @@ static void ptdump_walk_pgd_level_core(struct
seq_file *m, pgd_t *pgd,
bool checkwx)
 {
 #ifdef CONFIG_X86_64
+/* 8000 - 87ff is reserved for hypervisor */
+#define is_hypervisor_range(idx)  (paravirt_enabled() && \
+  ((idx >= pgd_index(__PAGE_OFFSET) - 16) && \
+   (idx < pgd_index(__PAGE_OFFSET
 pgd_t *start = (pgd_t *) _level4_pgt;
 #else
+#define is_hypervisor_range(idx)   0
 pgd_t *start = swapper_pg_dir;
 #endif
 pgprotval_t prot;
@@ -381,7 +386,7 @@ static void ptdump_walk_pgd_level_core(struct
seq_file *m, pgd_t *pgd,

 for (i = 0; i < PTRS_PER_PGD; i++) {
 st.current_address = normalize_addr(i * PGD_LEVEL_MULT);
-if (!pgd_none(*start)) {
+if (!pgd_none(*start) && !is_hypervisor_range(i)) {
 if (pgd_large(*start) || !pgd_present(*start)) {
 prot = pgd_flags(*start);
 note_page(m, , __pgprot(prot), 1);


Hi 

Re: [Xen-devel] Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-05 Thread Sander Eikelenboom

On 2015-11-05 00:13, Boris Ostrovsky wrote:

On 11/04/2015 03:02 PM, Sander Eikelenboom wrote:

On 2015-11-04 19:47, Stephen Smalley wrote:

On 11/04/2015 01:28 PM, Sander Eikelenboom wrote:

On 2015-11-04 16:52, Stephen Smalley wrote:

On 11/04/2015 06:55 AM, Sander Eikelenboom wrote:

Hi All,

I just tried to boot with the current linus mergewindow tree under 
Xen.
It fails with a kernel panic at boot with the new 
"CONFIG_DEBUG_WX"

option enabled.
Disabling it makes the kernel boot fine.

The splat:
[   18.424241] Freeing unused kernel memory: 1104K 
(822fc000 -

8241)
[   18.430314] Write protecting the kernel read-only data: 18432k
[   18.441054] Freeing unused kernel memory: 1144K 
(880001ae2000 -

880001c0)
[   18.447966] Freeing unused kernel memory: 1560K 
(88000207a000 -

88000220)
[   18.453947] BUG: unable to handle kernel paging request at
88055c883000
[   18.459943] IP: []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.465847] PGD 2212067 PUD 0
[   18.471564] Oops:  [#1] SMP
[   18.477248] Modules linked in:
[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
4.3.0-mw-20151104-linus-doflr+ #1
[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , 
BIOS

V1.8B1 09/13/2010
[   18.494778] task: 880059b9 ti: 880059b98000 
task.ti:

880059b98000
[   18.500852] RIP: e030:[] []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.507102] RSP: e02b:880059b9be48  EFLAGS: 00010296
[   18.513351] RAX: 88055c883000 RBX: 81ae2000 RCX:
8800
[   18.519733] RDX: 0067 RSI: 880059b9be98 RDI:
88001000
[   18.526129] RBP: 880059b9bf00 R08:  R09:

[   18.532522] R10: 88005fd0e790 R11: 0001 R12:
88008000
[   18.538891] R13: cfff R14: 880059b9be98 R15:

[   18.545247] FS:  () 
GS:88005f68()

knlGS:
[   18.551708] CS:  e033 DS:  ES:  CR0: 8005003b
[   18.558153] CR2: 88055c883000 CR3: 02211000 CR4:
0660
[   18.564686] Stack:
[   18.571106]  000159b9be50 82211000 88055c884000
0800
[   18.577704]  8000 88055c883000 0007
88005fd0e790
[   18.584291]  880059b9bed8 81156ace 0001

[   18.590916] Call Trace:
[   18.597458]  [] ? 
free_reserved_area+0x11e/0x120

[   18.604180]  []
ptdump_walk_pgd_level_checkwx+0x12/0x20
[   18.611014]  [] mark_rodata_ro+0xe9/0xf0
[   18.617819]  [] ? rest_init+0x80/0x80
[   18.624512]  [] kernel_init+0x18/0xe0
[   18.631095]  [] ret_from_fork+0x3f/0x70
[   18.637650]  [] ? rest_init+0x80/0x80
[   18.644178] Code: 70 ff ff ff 48 3b 85 58 ff ff ff 0f 84 c0 fe 
ff ff
48 8b 85 68 ff ff ff 48 c1 e0 10 48 c1 f8 10 48 89 45 b0 48 8b 85 
70 ff
ff ff <48> 8b 38 48 85 ff 0f 85 4e ff ff ff b9 02 00 00 00 31 d2 
4c 89

[   18.658246] RIP  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.665211]  RSP 
[   18.672073] CR2: 88055c883000
[   18.678852] ---[ end trace d84e34461c40637a ]---
[   18.685641] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0009
[   18.685641]
[   18.699520] Kernel Offset: disable



What's your .config?  Does cat /sys/kernel/debug/kernel_page_tables
produce a similar fault even with CONFIG_DEBUG_WX=n?


.config is attached

Hmm that sysfs file doesn't seem to exist then:
# cat /sys/kernel/debug/kernel_page_tables
cat: /sys/kernel/debug/kernel_page_tables: No such file or directory


Needs CONFIG_X86_PTDUMP=y.
Also assumes you have debugfs mounted there.


Recompiled, and the result is that it also blows up:



Can you try this:


diff --git a/arch/x86/mm/dump_pagetables.c 
b/arch/x86/mm/dump_pagetables.c

index 1bf417e..b534216 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -362,8 +362,13 @@ static void ptdump_walk_pgd_level_core(struct
seq_file *m, pgd_t *pgd,
bool checkwx)
 {
 #ifdef CONFIG_X86_64
+/* 8000 - 87ff is reserved for hypervisor */
+#define is_hypervisor_range(idx)  (paravirt_enabled() && \
+  ((idx >= pgd_index(__PAGE_OFFSET) - 16) && \
+   (idx < pgd_index(__PAGE_OFFSET
 pgd_t *start = (pgd_t *) _level4_pgt;
 #else
+#define is_hypervisor_range(idx)   0
 pgd_t *start = swapper_pg_dir;
 #endif
 pgprotval_t prot;
@@ -381,7 +386,7 @@ static void ptdump_walk_pgd_level_core(struct
seq_file *m, pgd_t *pgd,

 for (i = 0; i < PTRS_PER_PGD; i++) {
 st.current_address = normalize_addr(i * PGD_LEVEL_MULT);
-if (!pgd_none(*start)) {
+if (!pgd_none(*start) && !is_hypervisor_range(i)) {
 if (pgd_large(*start) || !pgd_present(*start)) {
 prot = pgd_flags(*start);
 note_page(m, , __pgprot(prot), 1);


Hi 

Re: [Xen-devel] Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-05 Thread Sander Eikelenboom

Thursday, November 5, 2015, 2:53:40 PM, you wrote:

> On 11/05/2015 04:13 AM, Sander Eikelenboom wrote:
>>
>> It makes "cat /sys/kernel/debug/kernel_page_tables" work and
>> prevents a kernel with CONFIG_DEBUG_WX=y from crashing at boot.

> Great. Our nightly runs also failed spectacularly due to this bug.

>>
>> It now does give a warning about an insecure W+X mapping, so 
>> CONFIG_DEBUG_WX=y
>> seems to be working. No idea how to interpret it though (and if it's a 
>> legit
>> warning).
>>
>> -- 
>> Sander
>>
>> [   19.034706] Freeing unused kernel memory: 1104K (822fc000 - 
>> 8241)
>> [   19.041339] Write protecting the kernel read-only data: 18432k
>> [   19.052596] Freeing unused kernel memory: 1144K (880001ae2000 - 
>> 880001c0)
>> [   19.060285] Freeing unused kernel memory: 1560K (88000207a000 - 
>> 88000220)
>> [   19.067079] [ cut here ]
>> [   19.073931] WARNING: CPU: 5 PID: 1 at 
>> arch/x86/mm/dump_pagetables.c:225 note_page+0x619/0x7e0()

> Yes, this apparently is a known issue: https://lkml.org/lkml/2015/11/4/476

> -boris

Ah thx for the pointer :)

--
Sander




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-04 Thread Sander Eikelenboom

On 2015-11-04 19:47, Stephen Smalley wrote:

On 11/04/2015 01:28 PM, Sander Eikelenboom wrote:

On 2015-11-04 16:52, Stephen Smalley wrote:

On 11/04/2015 06:55 AM, Sander Eikelenboom wrote:

Hi All,

I just tried to boot with the current linus mergewindow tree under 
Xen.

It fails with a kernel panic at boot with the new "CONFIG_DEBUG_WX"
option enabled.
Disabling it makes the kernel boot fine.

The splat:
[   18.424241] Freeing unused kernel memory: 1104K (822fc000 
-

8241)
[   18.430314] Write protecting the kernel read-only data: 18432k
[   18.441054] Freeing unused kernel memory: 1144K (880001ae2000 
-

880001c0)
[   18.447966] Freeing unused kernel memory: 1560K (88000207a000 
-

88000220)
[   18.453947] BUG: unable to handle kernel paging request at
88055c883000
[   18.459943] IP: []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.465847] PGD 2212067 PUD 0
[   18.471564] Oops:  [#1] SMP
[   18.477248] Modules linked in:
[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
4.3.0-mw-20151104-linus-doflr+ #1
[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , 
BIOS

V1.8B1 09/13/2010
[   18.494778] task: 880059b9 ti: 880059b98000 task.ti:
880059b98000
[   18.500852] RIP: e030:[]  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.507102] RSP: e02b:880059b9be48  EFLAGS: 00010296
[   18.513351] RAX: 88055c883000 RBX: 81ae2000 RCX:
8800
[   18.519733] RDX: 0067 RSI: 880059b9be98 RDI:
88001000
[   18.526129] RBP: 880059b9bf00 R08:  R09:

[   18.532522] R10: 88005fd0e790 R11: 0001 R12:
88008000
[   18.538891] R13: cfff R14: 880059b9be98 R15:

[   18.545247] FS:  () GS:88005f68()
knlGS:
[   18.551708] CS:  e033 DS:  ES:  CR0: 8005003b
[   18.558153] CR2: 88055c883000 CR3: 02211000 CR4:
0660
[   18.564686] Stack:
[   18.571106]  000159b9be50 82211000 88055c884000
0800
[   18.577704]  8000 88055c883000 0007
88005fd0e790
[   18.584291]  880059b9bed8 81156ace 0001

[   18.590916] Call Trace:
[   18.597458]  [] ? 
free_reserved_area+0x11e/0x120

[   18.604180]  []
ptdump_walk_pgd_level_checkwx+0x12/0x20
[   18.611014]  [] mark_rodata_ro+0xe9/0xf0
[   18.617819]  [] ? rest_init+0x80/0x80
[   18.624512]  [] kernel_init+0x18/0xe0
[   18.631095]  [] ret_from_fork+0x3f/0x70
[   18.637650]  [] ? rest_init+0x80/0x80
[   18.644178] Code: 70 ff ff ff 48 3b 85 58 ff ff ff 0f 84 c0 fe ff 
ff
48 8b 85 68 ff ff ff 48 c1 e0 10 48 c1 f8 10 48 89 45 b0 48 8b 85 70 
ff
ff ff <48> 8b 38 48 85 ff 0f 85 4e ff ff ff b9 02 00 00 00 31 d2 4c 
89

[   18.658246] RIP  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.665211]  RSP 
[   18.672073] CR2: 88055c883000
[   18.678852] ---[ end trace d84e34461c40637a ]---
[   18.685641] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0009
[   18.685641]
[   18.699520] Kernel Offset: disable



What's your .config?  Does cat /sys/kernel/debug/kernel_page_tables
produce a similar fault even with CONFIG_DEBUG_WX=n?


.config is attached

Hmm that sysfs file doesn't seem to exist then:
# cat /sys/kernel/debug/kernel_page_tables
cat: /sys/kernel/debug/kernel_page_tables: No such file or directory


Needs CONFIG_X86_PTDUMP=y.
Also assumes you have debugfs mounted there.


Recompiled, and the result is that it also blows up:
[  902.389247] BUG: unable to handle kernel paging request at 
88055c883000
[  902.402749] IP: [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[  902.416261] PGD 2212067 PUD 0
[  902.427768] Oops:  [#1] SMP
[  902.438137] Modules linked in:
[  902.448299] CPU: 2 PID: 21951 Comm: cat Not tainted 
4.3.0-mw-20151104-linus-doflr-nodebugwx-withptdump+ #1
[  902.458581] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
V1.8B1 09/13/2010
[  902.468850] task: 88004b49e300 ti: 88005928c000 task.ti: 
88005928c000
[  902.479133] RIP: e030:[]  [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[  902.489536] RSP: e02b:88005928fd20  EFLAGS: 00010296
[  902.499692] RAX: 88055c883000 RBX:  RCX: 
8800
[  902.509755] RDX: 0067 RSI: 88005928fd70 RDI: 
88001000
[  902.519680] RBP: 88005928fdd8 R08: 1000 R09: 

[  902.529555] R10:  R11: 0246 R12: 
88005928ff20
[  902.539349] R13: cfff R14: 88005928fd70 R15: 
880033c773c0
[  902.549081] FS:  7f56b07d4700() GS:88005f68() 
knlGS:

[  902.558690] CS:  e033 DS:  ES:  CR0: 8005003b
[  902.568111] CR2: 88055c883000 CR3: 4563f000 CR4: 
0660

[  902.577508] Stac

Re: Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-04 Thread Sander Eikelenboom

On 2015-11-04 16:52, Stephen Smalley wrote:

On 11/04/2015 06:55 AM, Sander Eikelenboom wrote:

Hi All,

I just tried to boot with the current linus mergewindow tree under 
Xen.

It fails with a kernel panic at boot with the new "CONFIG_DEBUG_WX"
option enabled.
Disabling it makes the kernel boot fine.

The splat:
[   18.424241] Freeing unused kernel memory: 1104K (822fc000 -
8241)
[   18.430314] Write protecting the kernel read-only data: 18432k
[   18.441054] Freeing unused kernel memory: 1144K (880001ae2000 -
880001c0)
[   18.447966] Freeing unused kernel memory: 1560K (88000207a000 -
88000220)
[   18.453947] BUG: unable to handle kernel paging request at
88055c883000
[   18.459943] IP: []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.465847] PGD 2212067 PUD 0
[   18.471564] Oops:  [#1] SMP
[   18.477248] Modules linked in:
[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
4.3.0-mw-20151104-linus-doflr+ #1
[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , 
BIOS

V1.8B1 09/13/2010
[   18.494778] task: 880059b9 ti: 880059b98000 task.ti:
880059b98000
[   18.500852] RIP: e030:[]  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.507102] RSP: e02b:880059b9be48  EFLAGS: 00010296
[   18.513351] RAX: 88055c883000 RBX: 81ae2000 RCX:
8800
[   18.519733] RDX: 0067 RSI: 880059b9be98 RDI:
88001000
[   18.526129] RBP: 880059b9bf00 R08:  R09:

[   18.532522] R10: 88005fd0e790 R11: 0001 R12:
88008000
[   18.538891] R13: cfff R14: 880059b9be98 R15:

[   18.545247] FS:  () GS:88005f68()
knlGS:
[   18.551708] CS:  e033 DS:  ES:  CR0: 8005003b
[   18.558153] CR2: 88055c883000 CR3: 02211000 CR4:
0660
[   18.564686] Stack:
[   18.571106]  000159b9be50 82211000 88055c884000
0800
[   18.577704]  8000 88055c883000 0007
88005fd0e790
[   18.584291]  880059b9bed8 81156ace 0001

[   18.590916] Call Trace:
[   18.597458]  [] ? free_reserved_area+0x11e/0x120
[   18.604180]  []
ptdump_walk_pgd_level_checkwx+0x12/0x20
[   18.611014]  [] mark_rodata_ro+0xe9/0xf0
[   18.617819]  [] ? rest_init+0x80/0x80
[   18.624512]  [] kernel_init+0x18/0xe0
[   18.631095]  [] ret_from_fork+0x3f/0x70
[   18.637650]  [] ? rest_init+0x80/0x80
[   18.644178] Code: 70 ff ff ff 48 3b 85 58 ff ff ff 0f 84 c0 fe ff 
ff
48 8b 85 68 ff ff ff 48 c1 e0 10 48 c1 f8 10 48 89 45 b0 48 8b 85 70 
ff

ff ff <48> 8b 38 48 85 ff 0f 85 4e ff ff ff b9 02 00 00 00 31 d2 4c 89
[   18.658246] RIP  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.665211]  RSP 
[   18.672073] CR2: 88055c883000
[   18.678852] ---[ end trace d84e34461c40637a ]---
[   18.685641] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0009
[   18.685641]
[   18.699520] Kernel Offset: disable



What's your .config?  Does cat /sys/kernel/debug/kernel_page_tables
produce a similar fault even with CONFIG_DEBUG_WX=n?


.config is attached

Hmm that sysfs file doesn't seem to exist then:
# cat /sys/kernel/debug/kernel_page_tables
cat: /sys/kernel/debug/kernel_page_tables: No such file or directory

--
Sander
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.3.0-mw-20151104-linus-doflr Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32

Re: Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-04 Thread Sander Eikelenboom

On 2015-11-04 19:06, Ingo Molnar wrote:

* Stephen Smalley  wrote:


On 11/04/2015 06:55 AM, Sander Eikelenboom wrote:
>Hi All,
>
>I just tried to boot with the current linus mergewindow tree under Xen.
>It fails with a kernel panic at boot with the new "CONFIG_DEBUG_WX"
>option enabled.
>Disabling it makes the kernel boot fine.
>
>The splat:
>[   18.424241] Freeing unused kernel memory: 1104K (822fc000 -
>8241)
>[   18.430314] Write protecting the kernel read-only data: 18432k
>[   18.441054] Freeing unused kernel memory: 1144K (880001ae2000 -
>880001c0)
>[   18.447966] Freeing unused kernel memory: 1560K (88000207a000 -
>88000220)
>[   18.453947] BUG: unable to handle kernel paging request at
>88055c883000
>[   18.459943] IP: []
>ptdump_walk_pgd_level_core+0x20e/0x440
>[   18.465847] PGD 2212067 PUD 0
>[   18.471564] Oops:  [#1] SMP
>[   18.477248] Modules linked in:
>[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
>4.3.0-mw-20151104-linus-doflr+ #1
>[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS
>V1.8B1 09/13/2010
>[   18.494778] task: 880059b9 ti: 880059b98000 task.ti:
>880059b98000
>[   18.500852] RIP: e030:[]  []
>ptdump_walk_pgd_level_core+0x20e/0x440


It would be nice to see which line of code this corresponds to. Doing 
this:


  gdb vmlinux
  list *0x8105af8e

should normally do the trick.

Thanks,

Ingo


Hi Ingo,

(gdb) list *0x8105af8e
0x8105af8e is in ptdump_walk_pgd_level_core 
(arch/x86/mm/dump_pagetables.c:181).

warning: Source file is more recent than executable.
176  * On 64 bits, sign-extend the 48 bit address to 64 bit
177  */
178 static unsigned long normalize_addr(unsigned long u)
179 {
180 #ifdef CONFIG_X86_64
181 return (signed long)(u << 16) >> 16;
182 #else
183 return u;
184 #endif
185 }

--
Sander


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-04 Thread Sander Eikelenboom

Hi All,

I just tried to boot with the current linus mergewindow tree under Xen.
It fails with a kernel panic at boot with the new "CONFIG_DEBUG_WX" 
option enabled.

Disabling it makes the kernel boot fine.

The splat:
[   18.424241] Freeing unused kernel memory: 1104K (822fc000 - 
8241)

[   18.430314] Write protecting the kernel read-only data: 18432k
[   18.441054] Freeing unused kernel memory: 1144K (880001ae2000 - 
880001c0)
[   18.447966] Freeing unused kernel memory: 1560K (88000207a000 - 
88000220)
[   18.453947] BUG: unable to handle kernel paging request at 
88055c883000
[   18.459943] IP: [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[   18.465847] PGD 2212067 PUD 0
[   18.471564] Oops:  [#1] SMP
[   18.477248] Modules linked in:
[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
4.3.0-mw-20151104-linus-doflr+ #1
[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
V1.8B1 09/13/2010
[   18.494778] task: 880059b9 ti: 880059b98000 task.ti: 
880059b98000
[   18.500852] RIP: e030:[]  [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[   18.507102] RSP: e02b:880059b9be48  EFLAGS: 00010296
[   18.513351] RAX: 88055c883000 RBX: 81ae2000 RCX: 
8800
[   18.519733] RDX: 0067 RSI: 880059b9be98 RDI: 
88001000
[   18.526129] RBP: 880059b9bf00 R08:  R09: 

[   18.532522] R10: 88005fd0e790 R11: 0001 R12: 
88008000
[   18.538891] R13: cfff R14: 880059b9be98 R15: 

[   18.545247] FS:  () GS:88005f68() 
knlGS:

[   18.551708] CS:  e033 DS:  ES:  CR0: 8005003b
[   18.558153] CR2: 88055c883000 CR3: 02211000 CR4: 
0660

[   18.564686] Stack:
[   18.571106]  000159b9be50 82211000 88055c884000 
0800
[   18.577704]  8000 88055c883000 0007 
88005fd0e790
[   18.584291]  880059b9bed8 81156ace 0001 


[   18.590916] Call Trace:
[   18.597458]  [] ? free_reserved_area+0x11e/0x120
[   18.604180]  [] 
ptdump_walk_pgd_level_checkwx+0x12/0x20

[   18.611014]  [] mark_rodata_ro+0xe9/0xf0
[   18.617819]  [] ? rest_init+0x80/0x80
[   18.624512]  [] kernel_init+0x18/0xe0
[   18.631095]  [] ret_from_fork+0x3f/0x70
[   18.637650]  [] ? rest_init+0x80/0x80
[   18.644178] Code: 70 ff ff ff 48 3b 85 58 ff ff ff 0f 84 c0 fe ff ff 
48 8b 85 68 ff ff ff 48 c1 e0 10 48 c1 f8 10 48 89 45 b0 48 8b 85 70 ff 
ff ff <48> 8b 38 48 85 ff 0f 85 4e ff ff ff b9 02 00 00 00 31 d2 4c 89
[   18.658246] RIP  [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[   18.665211]  RSP 
[   18.672073] CR2: 88055c883000
[   18.678852] ---[ end trace d84e34461c40637a ]---
[   18.685641] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0009

[   18.685641]
[   18.699520] Kernel Offset: disable

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-04 Thread Sander Eikelenboom

On 2015-11-04 19:47, Stephen Smalley wrote:

On 11/04/2015 01:28 PM, Sander Eikelenboom wrote:

On 2015-11-04 16:52, Stephen Smalley wrote:

On 11/04/2015 06:55 AM, Sander Eikelenboom wrote:

Hi All,

I just tried to boot with the current linus mergewindow tree under 
Xen.

It fails with a kernel panic at boot with the new "CONFIG_DEBUG_WX"
option enabled.
Disabling it makes the kernel boot fine.

The splat:
[   18.424241] Freeing unused kernel memory: 1104K (822fc000 
-

8241)
[   18.430314] Write protecting the kernel read-only data: 18432k
[   18.441054] Freeing unused kernel memory: 1144K (880001ae2000 
-

880001c0)
[   18.447966] Freeing unused kernel memory: 1560K (88000207a000 
-

88000220)
[   18.453947] BUG: unable to handle kernel paging request at
88055c883000
[   18.459943] IP: []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.465847] PGD 2212067 PUD 0
[   18.471564] Oops:  [#1] SMP
[   18.477248] Modules linked in:
[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
4.3.0-mw-20151104-linus-doflr+ #1
[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , 
BIOS

V1.8B1 09/13/2010
[   18.494778] task: 880059b9 ti: 880059b98000 task.ti:
880059b98000
[   18.500852] RIP: e030:[]  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.507102] RSP: e02b:880059b9be48  EFLAGS: 00010296
[   18.513351] RAX: 88055c883000 RBX: 81ae2000 RCX:
8800
[   18.519733] RDX: 0067 RSI: 880059b9be98 RDI:
88001000
[   18.526129] RBP: 880059b9bf00 R08:  R09:

[   18.532522] R10: 88005fd0e790 R11: 0001 R12:
88008000
[   18.538891] R13: cfff R14: 880059b9be98 R15:

[   18.545247] FS:  () GS:88005f68()
knlGS:
[   18.551708] CS:  e033 DS:  ES:  CR0: 8005003b
[   18.558153] CR2: 88055c883000 CR3: 02211000 CR4:
0660
[   18.564686] Stack:
[   18.571106]  000159b9be50 82211000 88055c884000
0800
[   18.577704]  8000 88055c883000 0007
88005fd0e790
[   18.584291]  880059b9bed8 81156ace 0001

[   18.590916] Call Trace:
[   18.597458]  [] ? 
free_reserved_area+0x11e/0x120

[   18.604180]  []
ptdump_walk_pgd_level_checkwx+0x12/0x20
[   18.611014]  [] mark_rodata_ro+0xe9/0xf0
[   18.617819]  [] ? rest_init+0x80/0x80
[   18.624512]  [] kernel_init+0x18/0xe0
[   18.631095]  [] ret_from_fork+0x3f/0x70
[   18.637650]  [] ? rest_init+0x80/0x80
[   18.644178] Code: 70 ff ff ff 48 3b 85 58 ff ff ff 0f 84 c0 fe ff 
ff
48 8b 85 68 ff ff ff 48 c1 e0 10 48 c1 f8 10 48 89 45 b0 48 8b 85 70 
ff
ff ff <48> 8b 38 48 85 ff 0f 85 4e ff ff ff b9 02 00 00 00 31 d2 4c 
89

[   18.658246] RIP  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.665211]  RSP 
[   18.672073] CR2: 88055c883000
[   18.678852] ---[ end trace d84e34461c40637a ]---
[   18.685641] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0009
[   18.685641]
[   18.699520] Kernel Offset: disable



What's your .config?  Does cat /sys/kernel/debug/kernel_page_tables
produce a similar fault even with CONFIG_DEBUG_WX=n?


.config is attached

Hmm that sysfs file doesn't seem to exist then:
# cat /sys/kernel/debug/kernel_page_tables
cat: /sys/kernel/debug/kernel_page_tables: No such file or directory


Needs CONFIG_X86_PTDUMP=y.
Also assumes you have debugfs mounted there.


Recompiled, and the result is that it also blows up:
[  902.389247] BUG: unable to handle kernel paging request at 
88055c883000
[  902.402749] IP: [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[  902.416261] PGD 2212067 PUD 0
[  902.427768] Oops:  [#1] SMP
[  902.438137] Modules linked in:
[  902.448299] CPU: 2 PID: 21951 Comm: cat Not tainted 
4.3.0-mw-20151104-linus-doflr-nodebugwx-withptdump+ #1
[  902.458581] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
V1.8B1 09/13/2010
[  902.468850] task: 88004b49e300 ti: 88005928c000 task.ti: 
88005928c000
[  902.479133] RIP: e030:[]  [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[  902.489536] RSP: e02b:88005928fd20  EFLAGS: 00010296
[  902.499692] RAX: 88055c883000 RBX:  RCX: 
8800
[  902.509755] RDX: 0067 RSI: 88005928fd70 RDI: 
88001000
[  902.519680] RBP: 88005928fdd8 R08: 1000 R09: 

[  902.529555] R10:  R11: 0246 R12: 
88005928ff20
[  902.539349] R13: cfff R14: 88005928fd70 R15: 
880033c773c0
[  902.549081] FS:  7f56b07d4700() GS:88005f68() 
knlGS:

[  902.558690] CS:  e033 DS:  ES:  CR0: 8005003b
[  902.568111] CR2: 88055c883000 CR3: 4563f000 CR4: 
0660

[  902.577508] Stac

Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-04 Thread Sander Eikelenboom

Hi All,

I just tried to boot with the current linus mergewindow tree under Xen.
It fails with a kernel panic at boot with the new "CONFIG_DEBUG_WX" 
option enabled.

Disabling it makes the kernel boot fine.

The splat:
[   18.424241] Freeing unused kernel memory: 1104K (822fc000 - 
8241)

[   18.430314] Write protecting the kernel read-only data: 18432k
[   18.441054] Freeing unused kernel memory: 1144K (880001ae2000 - 
880001c0)
[   18.447966] Freeing unused kernel memory: 1560K (88000207a000 - 
88000220)
[   18.453947] BUG: unable to handle kernel paging request at 
88055c883000
[   18.459943] IP: [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[   18.465847] PGD 2212067 PUD 0
[   18.471564] Oops:  [#1] SMP
[   18.477248] Modules linked in:
[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
4.3.0-mw-20151104-linus-doflr+ #1
[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
V1.8B1 09/13/2010
[   18.494778] task: 880059b9 ti: 880059b98000 task.ti: 
880059b98000
[   18.500852] RIP: e030:[]  [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[   18.507102] RSP: e02b:880059b9be48  EFLAGS: 00010296
[   18.513351] RAX: 88055c883000 RBX: 81ae2000 RCX: 
8800
[   18.519733] RDX: 0067 RSI: 880059b9be98 RDI: 
88001000
[   18.526129] RBP: 880059b9bf00 R08:  R09: 

[   18.532522] R10: 88005fd0e790 R11: 0001 R12: 
88008000
[   18.538891] R13: cfff R14: 880059b9be98 R15: 

[   18.545247] FS:  () GS:88005f68() 
knlGS:

[   18.551708] CS:  e033 DS:  ES:  CR0: 8005003b
[   18.558153] CR2: 88055c883000 CR3: 02211000 CR4: 
0660

[   18.564686] Stack:
[   18.571106]  000159b9be50 82211000 88055c884000 
0800
[   18.577704]  8000 88055c883000 0007 
88005fd0e790
[   18.584291]  880059b9bed8 81156ace 0001 


[   18.590916] Call Trace:
[   18.597458]  [] ? free_reserved_area+0x11e/0x120
[   18.604180]  [] 
ptdump_walk_pgd_level_checkwx+0x12/0x20

[   18.611014]  [] mark_rodata_ro+0xe9/0xf0
[   18.617819]  [] ? rest_init+0x80/0x80
[   18.624512]  [] kernel_init+0x18/0xe0
[   18.631095]  [] ret_from_fork+0x3f/0x70
[   18.637650]  [] ? rest_init+0x80/0x80
[   18.644178] Code: 70 ff ff ff 48 3b 85 58 ff ff ff 0f 84 c0 fe ff ff 
48 8b 85 68 ff ff ff 48 c1 e0 10 48 c1 f8 10 48 89 45 b0 48 8b 85 70 ff 
ff ff <48> 8b 38 48 85 ff 0f 85 4e ff ff ff b9 02 00 00 00 31 d2 4c 89
[   18.658246] RIP  [] 
ptdump_walk_pgd_level_core+0x20e/0x440

[   18.665211]  RSP 
[   18.672073] CR2: 88055c883000
[   18.678852] ---[ end trace d84e34461c40637a ]---
[   18.685641] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0009

[   18.685641]
[   18.699520] Kernel Offset: disable

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-04 Thread Sander Eikelenboom

On 2015-11-04 19:06, Ingo Molnar wrote:

* Stephen Smalley <s...@tycho.nsa.gov> wrote:


On 11/04/2015 06:55 AM, Sander Eikelenboom wrote:
>Hi All,
>
>I just tried to boot with the current linus mergewindow tree under Xen.
>It fails with a kernel panic at boot with the new "CONFIG_DEBUG_WX"
>option enabled.
>Disabling it makes the kernel boot fine.
>
>The splat:
>[   18.424241] Freeing unused kernel memory: 1104K (822fc000 -
>8241)
>[   18.430314] Write protecting the kernel read-only data: 18432k
>[   18.441054] Freeing unused kernel memory: 1144K (880001ae2000 -
>880001c0)
>[   18.447966] Freeing unused kernel memory: 1560K (88000207a000 -
>88000220)
>[   18.453947] BUG: unable to handle kernel paging request at
>88055c883000
>[   18.459943] IP: []
>ptdump_walk_pgd_level_core+0x20e/0x440
>[   18.465847] PGD 2212067 PUD 0
>[   18.471564] Oops:  [#1] SMP
>[   18.477248] Modules linked in:
>[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
>4.3.0-mw-20151104-linus-doflr+ #1
>[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS
>V1.8B1 09/13/2010
>[   18.494778] task: 880059b9 ti: 880059b98000 task.ti:
>880059b98000
>[   18.500852] RIP: e030:[]  []
>ptdump_walk_pgd_level_core+0x20e/0x440


It would be nice to see which line of code this corresponds to. Doing 
this:


  gdb vmlinux
  list *0x8105af8e

should normally do the trick.

Thanks,

Ingo


Hi Ingo,

(gdb) list *0x8105af8e
0x8105af8e is in ptdump_walk_pgd_level_core 
(arch/x86/mm/dump_pagetables.c:181).

warning: Source file is more recent than executable.
176  * On 64 bits, sign-extend the 48 bit address to 64 bit
177  */
178 static unsigned long normalize_addr(unsigned long u)
179 {
180 #ifdef CONFIG_X86_64
181 return (signed long)(u << 16) >> 16;
182 #else
183 return u;
184 #endif
185 }

--
Sander


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.4 MW: Boot under Xen fails with CONFIG_DEBUG_WX enabled: RIP: ptdump_walk_pgd_level_core

2015-11-04 Thread Sander Eikelenboom

On 2015-11-04 16:52, Stephen Smalley wrote:

On 11/04/2015 06:55 AM, Sander Eikelenboom wrote:

Hi All,

I just tried to boot with the current linus mergewindow tree under 
Xen.

It fails with a kernel panic at boot with the new "CONFIG_DEBUG_WX"
option enabled.
Disabling it makes the kernel boot fine.

The splat:
[   18.424241] Freeing unused kernel memory: 1104K (822fc000 -
8241)
[   18.430314] Write protecting the kernel read-only data: 18432k
[   18.441054] Freeing unused kernel memory: 1144K (880001ae2000 -
880001c0)
[   18.447966] Freeing unused kernel memory: 1560K (88000207a000 -
88000220)
[   18.453947] BUG: unable to handle kernel paging request at
88055c883000
[   18.459943] IP: []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.465847] PGD 2212067 PUD 0
[   18.471564] Oops:  [#1] SMP
[   18.477248] Modules linked in:
[   18.482918] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
4.3.0-mw-20151104-linus-doflr+ #1
[   18.488804] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , 
BIOS

V1.8B1 09/13/2010
[   18.494778] task: 880059b9 ti: 880059b98000 task.ti:
880059b98000
[   18.500852] RIP: e030:[]  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.507102] RSP: e02b:880059b9be48  EFLAGS: 00010296
[   18.513351] RAX: 88055c883000 RBX: 81ae2000 RCX:
8800
[   18.519733] RDX: 0067 RSI: 880059b9be98 RDI:
88001000
[   18.526129] RBP: 880059b9bf00 R08:  R09:

[   18.532522] R10: 88005fd0e790 R11: 0001 R12:
88008000
[   18.538891] R13: cfff R14: 880059b9be98 R15:

[   18.545247] FS:  () GS:88005f68()
knlGS:
[   18.551708] CS:  e033 DS:  ES:  CR0: 8005003b
[   18.558153] CR2: 88055c883000 CR3: 02211000 CR4:
0660
[   18.564686] Stack:
[   18.571106]  000159b9be50 82211000 88055c884000
0800
[   18.577704]  8000 88055c883000 0007
88005fd0e790
[   18.584291]  880059b9bed8 81156ace 0001

[   18.590916] Call Trace:
[   18.597458]  [] ? free_reserved_area+0x11e/0x120
[   18.604180]  []
ptdump_walk_pgd_level_checkwx+0x12/0x20
[   18.611014]  [] mark_rodata_ro+0xe9/0xf0
[   18.617819]  [] ? rest_init+0x80/0x80
[   18.624512]  [] kernel_init+0x18/0xe0
[   18.631095]  [] ret_from_fork+0x3f/0x70
[   18.637650]  [] ? rest_init+0x80/0x80
[   18.644178] Code: 70 ff ff ff 48 3b 85 58 ff ff ff 0f 84 c0 fe ff 
ff
48 8b 85 68 ff ff ff 48 c1 e0 10 48 c1 f8 10 48 89 45 b0 48 8b 85 70 
ff

ff ff <48> 8b 38 48 85 ff 0f 85 4e ff ff ff b9 02 00 00 00 31 d2 4c 89
[   18.658246] RIP  []
ptdump_walk_pgd_level_core+0x20e/0x440
[   18.665211]  RSP 
[   18.672073] CR2: 88055c883000
[   18.678852] ---[ end trace d84e34461c40637a ]---
[   18.685641] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0009
[   18.685641]
[   18.699520] Kernel Offset: disable



What's your .config?  Does cat /sys/kernel/debug/kernel_page_tables
produce a similar fault even with CONFIG_DEBUG_WX=n?


.config is attached

Hmm that sysfs file doesn't seem to exist then:
# cat /sys/kernel/debug/kernel_page_tables
cat: /sys/kernel/debug/kernel_page_tables: No such file or directory

--
Sander
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.3.0-mw-20151104-linus-doflr Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32

Re: Linux 4.2-rc6 regression: RIP: e030:[] [] detach_if_pending+0x18/0x80

2015-08-17 Thread Sander Eikelenboom

On 2015-08-17 19:18, Eric Dumazet wrote:

From: Eric Dumazet 

On Mon, 2015-08-17 at 16:25 +0200, Sander Eikelenboom wrote:

Monday, August 17, 2015, 4:21:47 PM, you wrote:

> On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
>> This is very similar to the behavior I am seeing in this bug:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=102911

> OK, but have you applied the fix ?

> 
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

> It will be part of net iteration from David Miller to Linus Torvald.


I did have that patch in for my last report.
But i don't think he had (looking at the second part of his oops).



Then can you try following fix as well ?

Thanks !


Running now :)




[PATCH] timer: fix a race in __mod_timer()

lock_timer_base() can not catch following :

CPU1 ( in __mod_timer()
timer->flags |= TIMER_MIGRATING;
spin_unlock(>lock);
base = new_base;
spin_lock(>lock);
timer->flags &= ~TIMER_BASEMASK;
  CPU2 (in lock_timer_base())
  see timer base is cpu0 base
  spin_lock_irqsave(>lock, 
*flags);

  if (timer->flags == tf)
   return base; // oops, wrong base
timer->flags |= base->cpu // too late

We must write timer->flags in one go, otherwise we can fool other cpus.

Fixes: bc7a34b8b9eb ("timer: Reduce timer migration overhead if 
disabled")

Signed-off-by: Eric Dumazet 
Cc: Thomas Gleixner 
---
 kernel/time/timer.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 5e097fa9faf7..84190f02b521 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -807,8 +807,8 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,

spin_unlock(>lock);
base = new_base;
spin_lock(>lock);
-   timer->flags &= ~TIMER_BASEMASK;
-   timer->flags |= base->cpu;
+   WRITE_ONCE(timer->flags,
+  (timer->flags & ~TIMER_BASEMASK) | 
base->cpu);
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.2-rc6 regression: RIP: e030:[] [] detach_if_pending+0x18/0x80

2015-08-17 Thread Sander Eikelenboom

Monday, August 17, 2015, 4:21:47 PM, you wrote:

> On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
>> This is very similar to the behavior I am seeing in this bug:
>> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=102911

> OK, but have you applied the fix ?

> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

> It will be part of net iteration from David Miller to Linus Torvald.


I did have that patch in for my last report.
But i don't think he had (looking at the second part of his oops).
 
--
Sander

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.2-rc6 regression: RIP: e030:[] [] detach_if_pending+0x18/0x80

2015-08-17 Thread Sander Eikelenboom

Monday, August 17, 2015, 3:37:13 PM, you wrote:

> On Mon, 2015-08-17 at 11:09 +0200, Sander Eikelenboom wrote:
>> Saturday, August 15, 2015, 12:39:25 AM, you wrote:
>> 
>> > On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
>> >> On 2015-08-13 00:41, Eric Dumazet wrote:
>> >> > On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
>> >> > 
>> >> >> Thanks for the reminder, but luckily i was aware of that,
>> >> >> seen enough of your replies asking for patches to be resubmitted
>> >> >> against "the other tree" ;)
>> >> >> Kernel with patch is currently running so fingers crossed.
>> >> > 
>> >> > Thanks for testing. I am definitely interested knowing your results.
>> >> 
>> >> Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is 
>> >> breaking things
>> >> (have to test if a revert helps) i get this in some guests:
>> 
>> 
>> > Yes, this was fixed by :
>> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>> 
>> 
>> Hi Eric,
>> 
>> With that patch i had a crash again this night, see below.
>> 
>> --
>> Sander
>> 
>> [177459.188808] general protection fault:  [#1] SMP 
>> [177459.199746] Modules linked in:
>> [177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
>> 4.2.0-rc6-20150815-linus-doflr-net+ #1
>> [177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>> V1.8B1 09/13/2010
>> [177459.232247] task: 8221a580 ti: 8220 task.ti: 
>> 8220
>> [177459.242931] RIP: e030:[]  [] 
>> detach_if_pending+0x18/0x80
>> [177459.253503] RSP: e02b:88005f6039d8  EFLAGS: 00010086
>> [177459.264051] RAX: 8800584d6580 RBX: 880004901420 RCX: 
>> dead00200200
>> [177459.274599] RDX:  RSI: 88005f60e5c0 RDI: 
>> 880004901420
>> [177459.285122] RBP: 88005f6039d8 R08: 0001 R09: 
>> 
>> [177459.295286] R10: 0003 R11: 880004901394 R12: 
>> 0003
>> [177459.305388] R13: 00010ae47040 R14: 07b98a00 R15: 
>> 88005f60e5c0
>> [177459.315345] FS:  7f51317ec700() GS:88005f60() 
>> knlGS:
>> [177459.325340] CS:  e033 DS:  ES:  CR0: 8005003b
>> [177459.335217] CR2: 010f8000 CR3: 2a154000 CR4: 
>> 0660
>> [177459.345129] Stack:
>> [177459.354783]  88005f603a28 8110ee7f 810fb261 
>> 0200
>> [177459.364505]  0003 880004901380 0003 
>> 8800567d0d00
>> [177459.374064]  07b98a00  88005f603a58 
>> 819b3eb3
>> [177459.383532] Call Trace:
>> [177459.392878]   
>> [177459.392935]  [] mod_timer_pending+0x3f/0xe0
>> [177459.411058]  [] ? 
>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>> [177459.419876]  [] __nf_ct_refresh_acct+0xa3/0xb0
>> [177459.428642]  [] tcp_packet+0xb3b/0x1290
>> [177459.437285]  [] ? ip_output+0x5e/0xc0
>> [177459.445845]  [] ? __local_bh_enable_ip+0x2a/0x90
>> [177459.454331]  [] ? __nf_conntrack_find_get+0x129/0x2a0
>> [177459.462642]  [] nf_conntrack_in+0x29c/0x7c0
>> [177459.470711]  [] ipv4_conntrack_local+0x4c/0x50
>> [177459.478753]  [] nf_iterate+0x4c/0x80
>> [177459.486726]  [] ? generic_handle_irq+0x27/0x40
>> [177459.494634]  [] nf_hook_slow+0x64/0xc0
>> [177459.502486]  [] __ip_local_out_sk+0x90/0xa0
>> [177459.510248]  [] ? ip_forward_options+0x1a0/0x1a0
>> [177459.517782]  [] ip_local_out_sk+0x16/0x40
>> [177459.525044]  [] ip_queue_xmit+0x14d/0x350
>> [177459.532247]  [] tcp_transmit_skb+0x48e/0x960
>> [177459.539413]  [] tcp_xmit_probe_skb+0xdb/0xf0
>> [177459.546389]  [] tcp_write_wakeup+0x5b/0x150
>> [177459.553061]  [] tcp_keepalive_timer+0x1fb/0x230
>> [177459.559761]  [] ? tcp_init_xmit_timers+0x20/0x20
>> [177459.566447]  [] call_timer_fn.isra.27+0x17/0x80
>> [177459.573121]  [] ? tcp_init_xmit_timers+0x20/0x20
>> [177459.579778]  [] run_timer_softirq+0x12d/0x200
>> [177459.586448]  [] __do_softirq+0x103/0x210
>> [177459.593138]  [] irq_exit+0x4b/0xa0
>> [177459.599783]  [] xen_evtchn_do_upcall+0x34/0x50
>> [177459.606300]  [] xen_do_hypervisor_callback+0x1e/0x40
>> [177459.612583]   
>> [177459.612637]  [] ? xen_hypercall_sched_op+0xa/0x20
>> [177459.62

Re: Linux 4.2-rc6 regression: RIP: e030:[] [] detach_if_pending+0x18/0x80

2015-08-17 Thread Sander Eikelenboom

Saturday, August 15, 2015, 12:39:25 AM, you wrote:

> On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
>> On 2015-08-13 00:41, Eric Dumazet wrote:
>> > On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
>> > 
>> >> Thanks for the reminder, but luckily i was aware of that,
>> >> seen enough of your replies asking for patches to be resubmitted
>> >> against "the other tree" ;)
>> >> Kernel with patch is currently running so fingers crossed.
>> > 
>> > Thanks for testing. I am definitely interested knowing your results.
>> 
>> Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is 
>> breaking things
>> (have to test if a revert helps) i get this in some guests:


> Yes, this was fixed by :
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af


Hi Eric,

With that patch i had a crash again this night, see below.

--
Sander

[177459.188808] general protection fault:  [#1] SMP 
[177459.199746] Modules linked in:
[177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.2.0-rc6-20150815-linus-doflr-net+ #1
[177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[177459.232247] task: 8221a580 ti: 8220 task.ti: 
8220
[177459.242931] RIP: e030:[]  [] 
detach_if_pending+0x18/0x80
[177459.253503] RSP: e02b:88005f6039d8  EFLAGS: 00010086
[177459.264051] RAX: 8800584d6580 RBX: 880004901420 RCX: 
dead00200200
[177459.274599] RDX:  RSI: 88005f60e5c0 RDI: 
880004901420
[177459.285122] RBP: 88005f6039d8 R08: 0001 R09: 

[177459.295286] R10: 0003 R11: 880004901394 R12: 
0003
[177459.305388] R13: 00010ae47040 R14: 07b98a00 R15: 
88005f60e5c0
[177459.315345] FS:  7f51317ec700() GS:88005f60() 
knlGS:
[177459.325340] CS:  e033 DS:  ES:  CR0: 8005003b
[177459.335217] CR2: 010f8000 CR3: 2a154000 CR4: 
0660
[177459.345129] Stack:
[177459.354783]  88005f603a28 8110ee7f 810fb261 
0200
[177459.364505]  0003 880004901380 0003 
8800567d0d00
[177459.374064]  07b98a00  88005f603a58 
819b3eb3
[177459.383532] Call Trace:
[177459.392878]   
[177459.392935]  [] mod_timer_pending+0x3f/0xe0
[177459.411058]  [] ? 
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[177459.419876]  [] __nf_ct_refresh_acct+0xa3/0xb0
[177459.428642]  [] tcp_packet+0xb3b/0x1290
[177459.437285]  [] ? ip_output+0x5e/0xc0
[177459.445845]  [] ? __local_bh_enable_ip+0x2a/0x90
[177459.454331]  [] ? __nf_conntrack_find_get+0x129/0x2a0
[177459.462642]  [] nf_conntrack_in+0x29c/0x7c0
[177459.470711]  [] ipv4_conntrack_local+0x4c/0x50
[177459.478753]  [] nf_iterate+0x4c/0x80
[177459.486726]  [] ? generic_handle_irq+0x27/0x40
[177459.494634]  [] nf_hook_slow+0x64/0xc0
[177459.502486]  [] __ip_local_out_sk+0x90/0xa0
[177459.510248]  [] ? ip_forward_options+0x1a0/0x1a0
[177459.517782]  [] ip_local_out_sk+0x16/0x40
[177459.525044]  [] ip_queue_xmit+0x14d/0x350
[177459.532247]  [] tcp_transmit_skb+0x48e/0x960
[177459.539413]  [] tcp_xmit_probe_skb+0xdb/0xf0
[177459.546389]  [] tcp_write_wakeup+0x5b/0x150
[177459.553061]  [] tcp_keepalive_timer+0x1fb/0x230
[177459.559761]  [] ? tcp_init_xmit_timers+0x20/0x20
[177459.566447]  [] call_timer_fn.isra.27+0x17/0x80
[177459.573121]  [] ? tcp_init_xmit_timers+0x20/0x20
[177459.579778]  [] run_timer_softirq+0x12d/0x200
[177459.586448]  [] __do_softirq+0x103/0x210
[177459.593138]  [] irq_exit+0x4b/0xa0
[177459.599783]  [] xen_evtchn_do_upcall+0x34/0x50
[177459.606300]  [] xen_do_hypervisor_callback+0x1e/0x40
[177459.612583]   
[177459.612637]  [] ? xen_hypercall_sched_op+0xa/0x20
[177459.625010]  [] ? xen_hypercall_sched_op+0xa/0x20
[177459.631157]  [] ? xen_safe_halt+0x10/0x20
[177459.637158]  [] ? default_idle+0x13/0x20
[177459.643072]  [] ? arch_cpu_idle+0xa/0x10
[177459.648809]  [] ? default_idle_call+0x2e/0x50
[177459.654650]  [] ? cpu_startup_entry+0x272/0x2e0
[177459.660488]  [] ? rest_init+0x77/0x80
[177459.666297]  [] ? start_kernel+0x43b/0x448
[177459.672092]  [] ? x86_64_start_reservations+0x2a/0x2c
[177459.677800]  [] ? xen_start_kernel+0x550/0x55c
[177459.683451] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 
48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89 08 74 04 <48> 89 
41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10 48 
[177459.695332] RIP  [] detach_if_pending+0x18/0x80
[177459.701154]  RSP 
(XEN) [2015-08-17 00:11:51.426] Hardware Dom0 crashed: rebooting machine in 5 
seconds.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-17 Thread Sander Eikelenboom

Saturday, August 15, 2015, 12:39:25 AM, you wrote:

 On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
 On 2015-08-13 00:41, Eric Dumazet wrote:
  On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
  
  Thanks for the reminder, but luckily i was aware of that,
  seen enough of your replies asking for patches to be resubmitted
  against the other tree ;)
  Kernel with patch is currently running so fingers crossed.
  
  Thanks for testing. I am definitely interested knowing your results.
 
 Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is 
 breaking things
 (have to test if a revert helps) i get this in some guests:


 Yes, this was fixed by :
 http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af


Hi Eric,

With that patch i had a crash again this night, see below.

--
Sander

[177459.188808] general protection fault:  [#1] SMP 
[177459.199746] Modules linked in:
[177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.2.0-rc6-20150815-linus-doflr-net+ #1
[177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[177459.232247] task: 8221a580 ti: 8220 task.ti: 
8220
[177459.242931] RIP: e030:[8110eb58]  [8110eb58] 
detach_if_pending+0x18/0x80
[177459.253503] RSP: e02b:88005f6039d8  EFLAGS: 00010086
[177459.264051] RAX: 8800584d6580 RBX: 880004901420 RCX: 
dead00200200
[177459.274599] RDX:  RSI: 88005f60e5c0 RDI: 
880004901420
[177459.285122] RBP: 88005f6039d8 R08: 0001 R09: 

[177459.295286] R10: 0003 R11: 880004901394 R12: 
0003
[177459.305388] R13: 00010ae47040 R14: 07b98a00 R15: 
88005f60e5c0
[177459.315345] FS:  7f51317ec700() GS:88005f60() 
knlGS:
[177459.325340] CS:  e033 DS:  ES:  CR0: 8005003b
[177459.335217] CR2: 010f8000 CR3: 2a154000 CR4: 
0660
[177459.345129] Stack:
[177459.354783]  88005f603a28 8110ee7f 810fb261 
0200
[177459.364505]  0003 880004901380 0003 
8800567d0d00
[177459.374064]  07b98a00  88005f603a58 
819b3eb3
[177459.383532] Call Trace:
[177459.392878]  IRQ 
[177459.392935]  [8110ee7f] mod_timer_pending+0x3f/0xe0
[177459.411058]  [810fb261] ? 
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[177459.419876]  [819b3eb3] __nf_ct_refresh_acct+0xa3/0xb0
[177459.428642]  [819baafb] tcp_packet+0xb3b/0x1290
[177459.437285]  [81a2535e] ? ip_output+0x5e/0xc0
[177459.445845]  [810ca8ca] ? __local_bh_enable_ip+0x2a/0x90
[177459.454331]  [819b35a9] ? __nf_conntrack_find_get+0x129/0x2a0
[177459.462642]  [819b549c] nf_conntrack_in+0x29c/0x7c0
[177459.470711]  [81a65e9c] ipv4_conntrack_local+0x4c/0x50
[177459.478753]  [819ad67c] nf_iterate+0x4c/0x80
[177459.486726]  [81102437] ? generic_handle_irq+0x27/0x40
[177459.494634]  [819ad714] nf_hook_slow+0x64/0xc0
[177459.502486]  [81a22d40] __ip_local_out_sk+0x90/0xa0
[177459.510248]  [81a22c40] ? ip_forward_options+0x1a0/0x1a0
[177459.517782]  [81a22d66] ip_local_out_sk+0x16/0x40
[177459.525044]  [81a2343d] ip_queue_xmit+0x14d/0x350
[177459.532247]  [81a3ae7e] tcp_transmit_skb+0x48e/0x960
[177459.539413]  [81a3cddb] tcp_xmit_probe_skb+0xdb/0xf0
[177459.546389]  [81a3dffb] tcp_write_wakeup+0x5b/0x150
[177459.553061]  [81a3e51b] tcp_keepalive_timer+0x1fb/0x230
[177459.559761]  [81a3e320] ? tcp_init_xmit_timers+0x20/0x20
[177459.566447]  [8110f3c7] call_timer_fn.isra.27+0x17/0x80
[177459.573121]  [81a3e320] ? tcp_init_xmit_timers+0x20/0x20
[177459.579778]  [8110f55d] run_timer_softirq+0x12d/0x200
[177459.586448]  [810ca6c3] __do_softirq+0x103/0x210
[177459.593138]  [810ca9cb] irq_exit+0x4b/0xa0
[177459.599783]  [814f05d4] xen_evtchn_do_upcall+0x34/0x50
[177459.606300]  [81af93ae] xen_do_hypervisor_callback+0x1e/0x40
[177459.612583]  EOI 
[177459.612637]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
[177459.625010]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
[177459.631157]  [81008d60] ? xen_safe_halt+0x10/0x20
[177459.637158]  [810188d3] ? default_idle+0x13/0x20
[177459.643072]  [81018e1a] ? arch_cpu_idle+0xa/0x10
[177459.648809]  [810f8e7e] ? default_idle_call+0x2e/0x50
[177459.654650]  [810f9112] ? cpu_startup_entry+0x272/0x2e0
[177459.660488]  [81ae79f7] ? rest_init+0x77/0x80
[177459.666297]  [82312f58] ? start_kernel+0x43b/0x448
[177459.672092]  [823124ef] ? x86_64_start_reservations+0x2a/0x2c
[177459.677800]  [82316008] ? xen_start_kernel+0x550/0x55c
[177459.683451

Re: Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-17 Thread Sander Eikelenboom

Monday, August 17, 2015, 3:37:13 PM, you wrote:

 On Mon, 2015-08-17 at 11:09 +0200, Sander Eikelenboom wrote:
 Saturday, August 15, 2015, 12:39:25 AM, you wrote:
 
  On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
  On 2015-08-13 00:41, Eric Dumazet wrote:
   On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
   
   Thanks for the reminder, but luckily i was aware of that,
   seen enough of your replies asking for patches to be resubmitted
   against the other tree ;)
   Kernel with patch is currently running so fingers crossed.
   
   Thanks for testing. I am definitely interested knowing your results.
  
  Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is 
  breaking things
  (have to test if a revert helps) i get this in some guests:
 
 
  Yes, this was fixed by :
  http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
 
 
 Hi Eric,
 
 With that patch i had a crash again this night, see below.
 
 --
 Sander
 
 [177459.188808] general protection fault:  [#1] SMP 
 [177459.199746] Modules linked in:
 [177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
 4.2.0-rc6-20150815-linus-doflr-net+ #1
 [177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
 V1.8B1 09/13/2010
 [177459.232247] task: 8221a580 ti: 8220 task.ti: 
 8220
 [177459.242931] RIP: e030:[8110eb58]  [8110eb58] 
 detach_if_pending+0x18/0x80
 [177459.253503] RSP: e02b:88005f6039d8  EFLAGS: 00010086
 [177459.264051] RAX: 8800584d6580 RBX: 880004901420 RCX: 
 dead00200200
 [177459.274599] RDX:  RSI: 88005f60e5c0 RDI: 
 880004901420
 [177459.285122] RBP: 88005f6039d8 R08: 0001 R09: 
 
 [177459.295286] R10: 0003 R11: 880004901394 R12: 
 0003
 [177459.305388] R13: 00010ae47040 R14: 07b98a00 R15: 
 88005f60e5c0
 [177459.315345] FS:  7f51317ec700() GS:88005f60() 
 knlGS:
 [177459.325340] CS:  e033 DS:  ES:  CR0: 8005003b
 [177459.335217] CR2: 010f8000 CR3: 2a154000 CR4: 
 0660
 [177459.345129] Stack:
 [177459.354783]  88005f603a28 8110ee7f 810fb261 
 0200
 [177459.364505]  0003 880004901380 0003 
 8800567d0d00
 [177459.374064]  07b98a00  88005f603a58 
 819b3eb3
 [177459.383532] Call Trace:
 [177459.392878]  IRQ 
 [177459.392935]  [8110ee7f] mod_timer_pending+0x3f/0xe0
 [177459.411058]  [810fb261] ? 
 __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
 [177459.419876]  [819b3eb3] __nf_ct_refresh_acct+0xa3/0xb0
 [177459.428642]  [819baafb] tcp_packet+0xb3b/0x1290
 [177459.437285]  [81a2535e] ? ip_output+0x5e/0xc0
 [177459.445845]  [810ca8ca] ? __local_bh_enable_ip+0x2a/0x90
 [177459.454331]  [819b35a9] ? __nf_conntrack_find_get+0x129/0x2a0
 [177459.462642]  [819b549c] nf_conntrack_in+0x29c/0x7c0
 [177459.470711]  [81a65e9c] ipv4_conntrack_local+0x4c/0x50
 [177459.478753]  [819ad67c] nf_iterate+0x4c/0x80
 [177459.486726]  [81102437] ? generic_handle_irq+0x27/0x40
 [177459.494634]  [819ad714] nf_hook_slow+0x64/0xc0
 [177459.502486]  [81a22d40] __ip_local_out_sk+0x90/0xa0
 [177459.510248]  [81a22c40] ? ip_forward_options+0x1a0/0x1a0
 [177459.517782]  [81a22d66] ip_local_out_sk+0x16/0x40
 [177459.525044]  [81a2343d] ip_queue_xmit+0x14d/0x350
 [177459.532247]  [81a3ae7e] tcp_transmit_skb+0x48e/0x960
 [177459.539413]  [81a3cddb] tcp_xmit_probe_skb+0xdb/0xf0
 [177459.546389]  [81a3dffb] tcp_write_wakeup+0x5b/0x150
 [177459.553061]  [81a3e51b] tcp_keepalive_timer+0x1fb/0x230
 [177459.559761]  [81a3e320] ? tcp_init_xmit_timers+0x20/0x20
 [177459.566447]  [8110f3c7] call_timer_fn.isra.27+0x17/0x80
 [177459.573121]  [81a3e320] ? tcp_init_xmit_timers+0x20/0x20
 [177459.579778]  [8110f55d] run_timer_softirq+0x12d/0x200
 [177459.586448]  [810ca6c3] __do_softirq+0x103/0x210
 [177459.593138]  [810ca9cb] irq_exit+0x4b/0xa0
 [177459.599783]  [814f05d4] xen_evtchn_do_upcall+0x34/0x50
 [177459.606300]  [81af93ae] xen_do_hypervisor_callback+0x1e/0x40
 [177459.612583]  EOI 
 [177459.612637]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
 [177459.625010]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
 [177459.631157]  [81008d60] ? xen_safe_halt+0x10/0x20
 [177459.637158]  [810188d3] ? default_idle+0x13/0x20
 [177459.643072]  [81018e1a] ? arch_cpu_idle+0xa/0x10
 [177459.648809]  [810f8e7e] ? default_idle_call+0x2e/0x50
 [177459.654650]  [810f9112] ? cpu_startup_entry+0x272/0x2e0
 [177459.660488]  [81ae79f7] ? rest_init+0x77/0x80

Re: Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-17 Thread Sander Eikelenboom

Monday, August 17, 2015, 4:21:47 PM, you wrote:

 On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
 This is very similar to the behavior I am seeing in this bug:
 
 https://bugzilla.kernel.org/show_bug.cgi?id=102911

 OK, but have you applied the fix ?

 http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

 It will be part of net iteration from David Miller to Linus Torvald.


I did have that patch in for my last report.
But i don't think he had (looking at the second part of his oops).
 
--
Sander

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-17 Thread Sander Eikelenboom

On 2015-08-17 19:18, Eric Dumazet wrote:

From: Eric Dumazet eduma...@google.com

On Mon, 2015-08-17 at 16:25 +0200, Sander Eikelenboom wrote:

Monday, August 17, 2015, 4:21:47 PM, you wrote:

 On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
 This is very similar to the behavior I am seeing in this bug:

 https://bugzilla.kernel.org/show_bug.cgi?id=102911

 OK, but have you applied the fix ?

 
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

 It will be part of net iteration from David Miller to Linus Torvald.


I did have that patch in for my last report.
But i don't think he had (looking at the second part of his oops).



Then can you try following fix as well ?

Thanks !


Running now :)




[PATCH] timer: fix a race in __mod_timer()

lock_timer_base() can not catch following :

CPU1 ( in __mod_timer()
timer-flags |= TIMER_MIGRATING;
spin_unlock(base-lock);
base = new_base;
spin_lock(base-lock);
timer-flags = ~TIMER_BASEMASK;
  CPU2 (in lock_timer_base())
  see timer base is cpu0 base
  spin_lock_irqsave(base-lock, 
*flags);

  if (timer-flags == tf)
   return base; // oops, wrong base
timer-flags |= base-cpu // too late

We must write timer-flags in one go, otherwise we can fool other cpus.

Fixes: bc7a34b8b9eb (timer: Reduce timer migration overhead if 
disabled)

Signed-off-by: Eric Dumazet eduma...@google.com
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/time/timer.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 5e097fa9faf7..84190f02b521 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -807,8 +807,8 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,

spin_unlock(base-lock);
base = new_base;
spin_lock(base-lock);
-   timer-flags = ~TIMER_BASEMASK;
-   timer-flags |= base-cpu;
+   WRITE_ONCE(timer-flags,
+  (timer-flags  ~TIMER_BASEMASK) | 
base-cpu);
}
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.2-rc6 regression: RIP: e030:[] [] detach_if_pending+0x18/0x80

2015-08-14 Thread Sander Eikelenboom

On 2015-08-15 00:09, Sander Eikelenboom wrote:

On 2015-08-13 00:41, Eric Dumazet wrote:

On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:


Thanks for the reminder, but luckily i was aware of that,
seen enough of your replies asking for patches to be resubmitted
against "the other tree" ;)
Kernel with patch is currently running so fingers crossed.


Thanks for testing. I am definitely interested knowing your results.


Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is
breaking things
(have to test if a revert helps) i get this in some guests:


Should have done that before, because it wasn't in yet .. and likely to 
fix the issue,

also pulled and compiling now.

--
Sander




NMI watchdog: BUG: soft lockup - CPU#0 stuck for 506s! [swapper/0:0]
[ 6620.282805] Modules linked in:
[ 6620.282805] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.2.0-rc6-20150814-linus-doflr-apicrevert+ #1
[ 6620.282805] task: 8221a580 ti: 8220 task.ti:
8220
[ 6620.282805] RIP: e030:[]  []
xen_hypercall_xen_version+0xa/0x20
[ 6620.282805] RSP: e02b:88000fc03d48  EFLAGS: 0246
[ 6620.282805] RAX: 00040006 RBX: 0200 RCX: 
8100122a
[ 6620.282805] RDX: 0001 RSI: deadbeef RDI: 
deadbeef
[ 6620.282805] RBP: 88000fc03d60 R08: 88000fc03ee0 R09: 
00ee
[ 6620.282805] R10: 8220a0c0 R11: 0246 R12: 

[ 6620.282805] R13: 0001 R14: 880003b53054 R15: 
0005

[ 6620.282805] FS:  7fec747ad800() GS:88000fc0()
knlGS:
[ 6620.282805] CS:  e033 DS:  ES:  CR0: 8005003b
[ 6620.282805] CR2: 7ffcb7a7a6d8 CR3: 03164000 CR4: 
0660

[ 6620.282805] Stack:
[ 6620.282805]  0068 0007 81008dbd
88000fc03dd8
[ 6620.282805]  81009592 0068 8220a0c0
00ee
[ 6620.282805]  88000fc03ee0 0200 0200
0001
[ 6620.282805] Call Trace:
[ 6620.282805]  
[ 6620.282805]  [] ? 
xen_force_evtchn_callback+0xd/0x10

[ 6620.282805]  [] check_events+0x12/0x20
[ 6620.282805]  [] ? 
xen_restore_fl_direct_reloc+0x4/0x4
[ 6620.282805]  [] ? 
_raw_spin_unlock_irqrestore+0x25/0x30

[ 6620.282805]  [] try_to_del_timer_sync+0x43/0x60
[ 6620.282805]  [] del_timer_sync+0x47/0x60
[ 6620.282805]  [] 
inet_csk_reqsk_queue_drop+0x118/0x1f0

[ 6620.282805]  [] reqsk_timer_handler+0x156/0x260
[ 6620.282805]  [] ? 
inet_csk_reqsk_queue_drop+0x1f0/0x1f0

[ 6620.282805]  [] call_timer_fn.isra.27+0x17/0x80
[ 6620.282805]  [] ? 
inet_csk_reqsk_queue_drop+0x1f0/0x1f0

[ 6620.282805]  [] run_timer_softirq+0x12d/0x200
[ 6620.282805]  [] __do_softirq+0x103/0x210
[ 6620.282805]  [] irq_exit+0x4b/0xa0
[ 6620.282805]  [] xen_evtchn_do_upcall+0x34/0x50
[ 6620.282805]  [] 
xen_do_hypervisor_callback+0x1e/0x40

[ 6620.282805]  
[ 6620.282805]  [] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805]  [] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805]  [] ? xen_safe_halt+0x10/0x20
[ 6620.282805]  [] ? default_idle+0x13/0x20
[ 6620.282805]  [] ? arch_cpu_idle+0xa/0x10
[ 6620.282805]  [] ? default_idle_call+0x2e/0x50
[ 6620.282805]  [] ? cpu_startup_entry+0x272/0x2e0
[ 6620.282805]  [] ? rest_init+0x77/0x80
[ 6620.282805]  [] ? start_kernel+0x43b/0x448
[ 6620.282805]  [] ? 
x86_64_start_reservations+0x2a/0x2c

[ 6620.282805]  [] ? xen_start_kernel+0x550/0x55c
[ 6620.282805] Code: cc 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00
00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.2-rc6 regression: RIP: e030:[] [] detach_if_pending+0x18/0x80

2015-08-14 Thread Sander Eikelenboom

On 2015-08-13 00:41, Eric Dumazet wrote:

On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:


Thanks for the reminder, but luckily i was aware of that,
seen enough of your replies asking for patches to be resubmitted
against "the other tree" ;)
Kernel with patch is currently running so fingers crossed.


Thanks for testing. I am definitely interested knowing your results.


Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is 
breaking things

(have to test if a revert helps) i get this in some guests:

NMI watchdog: BUG: soft lockup - CPU#0 stuck for 506s! [swapper/0:0]
[ 6620.282805] Modules linked in:
[ 6620.282805] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.2.0-rc6-20150814-linus-doflr-apicrevert+ #1
[ 6620.282805] task: 8221a580 ti: 8220 task.ti: 
8220
[ 6620.282805] RIP: e030:[]  [] 
xen_hypercall_xen_version+0xa/0x20

[ 6620.282805] RSP: e02b:88000fc03d48  EFLAGS: 0246
[ 6620.282805] RAX: 00040006 RBX: 0200 RCX: 
8100122a
[ 6620.282805] RDX: 0001 RSI: deadbeef RDI: 
deadbeef
[ 6620.282805] RBP: 88000fc03d60 R08: 88000fc03ee0 R09: 
00ee
[ 6620.282805] R10: 8220a0c0 R11: 0246 R12: 

[ 6620.282805] R13: 0001 R14: 880003b53054 R15: 
0005
[ 6620.282805] FS:  7fec747ad800() GS:88000fc0() 
knlGS:

[ 6620.282805] CS:  e033 DS:  ES:  CR0: 8005003b
[ 6620.282805] CR2: 7ffcb7a7a6d8 CR3: 03164000 CR4: 
0660

[ 6620.282805] Stack:
[ 6620.282805]  0068 0007 81008dbd 
88000fc03dd8
[ 6620.282805]  81009592 0068 8220a0c0 
00ee
[ 6620.282805]  88000fc03ee0 0200 0200 
0001

[ 6620.282805] Call Trace:
[ 6620.282805]  
[ 6620.282805]  [] ? 
xen_force_evtchn_callback+0xd/0x10

[ 6620.282805]  [] check_events+0x12/0x20
[ 6620.282805]  [] ? 
xen_restore_fl_direct_reloc+0x4/0x4
[ 6620.282805]  [] ? 
_raw_spin_unlock_irqrestore+0x25/0x30

[ 6620.282805]  [] try_to_del_timer_sync+0x43/0x60
[ 6620.282805]  [] del_timer_sync+0x47/0x60
[ 6620.282805]  [] 
inet_csk_reqsk_queue_drop+0x118/0x1f0

[ 6620.282805]  [] reqsk_timer_handler+0x156/0x260
[ 6620.282805]  [] ? 
inet_csk_reqsk_queue_drop+0x1f0/0x1f0

[ 6620.282805]  [] call_timer_fn.isra.27+0x17/0x80
[ 6620.282805]  [] ? 
inet_csk_reqsk_queue_drop+0x1f0/0x1f0

[ 6620.282805]  [] run_timer_softirq+0x12d/0x200
[ 6620.282805]  [] __do_softirq+0x103/0x210
[ 6620.282805]  [] irq_exit+0x4b/0xa0
[ 6620.282805]  [] xen_evtchn_do_upcall+0x34/0x50
[ 6620.282805]  [] 
xen_do_hypervisor_callback+0x1e/0x40

[ 6620.282805]  
[ 6620.282805]  [] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805]  [] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805]  [] ? xen_safe_halt+0x10/0x20
[ 6620.282805]  [] ? default_idle+0x13/0x20
[ 6620.282805]  [] ? arch_cpu_idle+0xa/0x10
[ 6620.282805]  [] ? default_idle_call+0x2e/0x50
[ 6620.282805]  [] ? cpu_startup_entry+0x272/0x2e0
[ 6620.282805]  [] ? rest_init+0x77/0x80
[ 6620.282805]  [] ? start_kernel+0x43b/0x448
[ 6620.282805]  [] ? 
x86_64_start_reservations+0x2a/0x2c

[ 6620.282805]  [] ? xen_start_kernel+0x550/0x55c
[ 6620.282805] Code: cc 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc 
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 
0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-14 Thread Sander Eikelenboom

On 2015-08-13 00:41, Eric Dumazet wrote:

On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:


Thanks for the reminder, but luckily i was aware of that,
seen enough of your replies asking for patches to be resubmitted
against the other tree ;)
Kernel with patch is currently running so fingers crossed.


Thanks for testing. I am definitely interested knowing your results.


Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is 
breaking things

(have to test if a revert helps) i get this in some guests:

NMI watchdog: BUG: soft lockup - CPU#0 stuck for 506s! [swapper/0:0]
[ 6620.282805] Modules linked in:
[ 6620.282805] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.2.0-rc6-20150814-linus-doflr-apicrevert+ #1
[ 6620.282805] task: 8221a580 ti: 8220 task.ti: 
8220
[ 6620.282805] RIP: e030:[8100122a]  [8100122a] 
xen_hypercall_xen_version+0xa/0x20

[ 6620.282805] RSP: e02b:88000fc03d48  EFLAGS: 0246
[ 6620.282805] RAX: 00040006 RBX: 0200 RCX: 
8100122a
[ 6620.282805] RDX: 0001 RSI: deadbeef RDI: 
deadbeef
[ 6620.282805] RBP: 88000fc03d60 R08: 88000fc03ee0 R09: 
00ee
[ 6620.282805] R10: 8220a0c0 R11: 0246 R12: 

[ 6620.282805] R13: 0001 R14: 880003b53054 R15: 
0005
[ 6620.282805] FS:  7fec747ad800() GS:88000fc0() 
knlGS:

[ 6620.282805] CS:  e033 DS:  ES:  CR0: 8005003b
[ 6620.282805] CR2: 7ffcb7a7a6d8 CR3: 03164000 CR4: 
0660

[ 6620.282805] Stack:
[ 6620.282805]  0068 0007 81008dbd 
88000fc03dd8
[ 6620.282805]  81009592 0068 8220a0c0 
00ee
[ 6620.282805]  88000fc03ee0 0200 0200 
0001

[ 6620.282805] Call Trace:
[ 6620.282805]  IRQ
[ 6620.282805]  [81008dbd] ? 
xen_force_evtchn_callback+0xd/0x10

[ 6620.282805]  [81009592] check_events+0x12/0x20
[ 6620.282805]  [8100957f] ? 
xen_restore_fl_direct_reloc+0x4/0x4
[ 6620.282805]  [81af79a5] ? 
_raw_spin_unlock_irqrestore+0x25/0x30

[ 6620.282805]  [8110ed43] try_to_del_timer_sync+0x43/0x60
[ 6620.282805]  [8110eda7] del_timer_sync+0x47/0x60
[ 6620.282805]  [81a2b698] 
inet_csk_reqsk_queue_drop+0x118/0x1f0

[ 6620.282805]  [81a2b8c6] reqsk_timer_handler+0x156/0x260
[ 6620.282805]  [81a2b770] ? 
inet_csk_reqsk_queue_drop+0x1f0/0x1f0

[ 6620.282805]  [8110f3c7] call_timer_fn.isra.27+0x17/0x80
[ 6620.282805]  [81a2b770] ? 
inet_csk_reqsk_queue_drop+0x1f0/0x1f0

[ 6620.282805]  [8110f55d] run_timer_softirq+0x12d/0x200
[ 6620.282805]  [810ca6c3] __do_softirq+0x103/0x210
[ 6620.282805]  [810ca9cb] irq_exit+0x4b/0xa0
[ 6620.282805]  [814f05d4] xen_evtchn_do_upcall+0x34/0x50
[ 6620.282805]  [81af932e] 
xen_do_hypervisor_callback+0x1e/0x40

[ 6620.282805]  EOI
[ 6620.282805]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805]  [81008d60] ? xen_safe_halt+0x10/0x20
[ 6620.282805]  [810188d3] ? default_idle+0x13/0x20
[ 6620.282805]  [81018e1a] ? arch_cpu_idle+0xa/0x10
[ 6620.282805]  [810f8e7e] ? default_idle_call+0x2e/0x50
[ 6620.282805]  [810f9112] ? cpu_startup_entry+0x272/0x2e0
[ 6620.282805]  [81ae7967] ? rest_init+0x77/0x80
[ 6620.282805]  [82312f58] ? start_kernel+0x43b/0x448
[ 6620.282805]  [823124ef] ? 
x86_64_start_reservations+0x2a/0x2c

[ 6620.282805]  [82316008] ? xen_start_kernel+0x550/0x55c
[ 6620.282805] Code: cc 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc 
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 
0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-14 Thread Sander Eikelenboom

On 2015-08-15 00:09, Sander Eikelenboom wrote:

On 2015-08-13 00:41, Eric Dumazet wrote:

On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:


Thanks for the reminder, but luckily i was aware of that,
seen enough of your replies asking for patches to be resubmitted
against the other tree ;)
Kernel with patch is currently running so fingers crossed.


Thanks for testing. I am definitely interested knowing your results.


Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is
breaking things
(have to test if a revert helps) i get this in some guests:


Should have done that before, because it wasn't in yet .. and likely to 
fix the issue,

also pulled and compiling now.

--
Sander




NMI watchdog: BUG: soft lockup - CPU#0 stuck for 506s! [swapper/0:0]
[ 6620.282805] Modules linked in:
[ 6620.282805] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.2.0-rc6-20150814-linus-doflr-apicrevert+ #1
[ 6620.282805] task: 8221a580 ti: 8220 task.ti:
8220
[ 6620.282805] RIP: e030:[8100122a]  [8100122a]
xen_hypercall_xen_version+0xa/0x20
[ 6620.282805] RSP: e02b:88000fc03d48  EFLAGS: 0246
[ 6620.282805] RAX: 00040006 RBX: 0200 RCX: 
8100122a
[ 6620.282805] RDX: 0001 RSI: deadbeef RDI: 
deadbeef
[ 6620.282805] RBP: 88000fc03d60 R08: 88000fc03ee0 R09: 
00ee
[ 6620.282805] R10: 8220a0c0 R11: 0246 R12: 

[ 6620.282805] R13: 0001 R14: 880003b53054 R15: 
0005

[ 6620.282805] FS:  7fec747ad800() GS:88000fc0()
knlGS:
[ 6620.282805] CS:  e033 DS:  ES:  CR0: 8005003b
[ 6620.282805] CR2: 7ffcb7a7a6d8 CR3: 03164000 CR4: 
0660

[ 6620.282805] Stack:
[ 6620.282805]  0068 0007 81008dbd
88000fc03dd8
[ 6620.282805]  81009592 0068 8220a0c0
00ee
[ 6620.282805]  88000fc03ee0 0200 0200
0001
[ 6620.282805] Call Trace:
[ 6620.282805]  IRQ
[ 6620.282805]  [81008dbd] ? 
xen_force_evtchn_callback+0xd/0x10

[ 6620.282805]  [81009592] check_events+0x12/0x20
[ 6620.282805]  [8100957f] ? 
xen_restore_fl_direct_reloc+0x4/0x4
[ 6620.282805]  [81af79a5] ? 
_raw_spin_unlock_irqrestore+0x25/0x30

[ 6620.282805]  [8110ed43] try_to_del_timer_sync+0x43/0x60
[ 6620.282805]  [8110eda7] del_timer_sync+0x47/0x60
[ 6620.282805]  [81a2b698] 
inet_csk_reqsk_queue_drop+0x118/0x1f0

[ 6620.282805]  [81a2b8c6] reqsk_timer_handler+0x156/0x260
[ 6620.282805]  [81a2b770] ? 
inet_csk_reqsk_queue_drop+0x1f0/0x1f0

[ 6620.282805]  [8110f3c7] call_timer_fn.isra.27+0x17/0x80
[ 6620.282805]  [81a2b770] ? 
inet_csk_reqsk_queue_drop+0x1f0/0x1f0

[ 6620.282805]  [8110f55d] run_timer_softirq+0x12d/0x200
[ 6620.282805]  [810ca6c3] __do_softirq+0x103/0x210
[ 6620.282805]  [810ca9cb] irq_exit+0x4b/0xa0
[ 6620.282805]  [814f05d4] xen_evtchn_do_upcall+0x34/0x50
[ 6620.282805]  [81af932e] 
xen_do_hypervisor_callback+0x1e/0x40

[ 6620.282805]  EOI
[ 6620.282805]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805]  [81008d60] ? xen_safe_halt+0x10/0x20
[ 6620.282805]  [810188d3] ? default_idle+0x13/0x20
[ 6620.282805]  [81018e1a] ? arch_cpu_idle+0xa/0x10
[ 6620.282805]  [810f8e7e] ? default_idle_call+0x2e/0x50
[ 6620.282805]  [810f9112] ? cpu_startup_entry+0x272/0x2e0
[ 6620.282805]  [81ae7967] ? rest_init+0x77/0x80
[ 6620.282805]  [82312f58] ? start_kernel+0x43b/0x448
[ 6620.282805]  [823124ef] ? 
x86_64_start_reservations+0x2a/0x2c

[ 6620.282805]  [82316008] ? xen_start_kernel+0x550/0x55c
[ 6620.282805] Code: cc 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00
00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >