Re: net-next boot error

2018-07-27 Thread Michael S. Tsirkin
On Thu, Jul 26, 2018 at 10:17:48AM -0400, Steven Rostedt wrote:
> 
> [ Added Thomas Gleixner ]
> 
> 
> On Thu, 26 Jul 2018 11:34:39 +0200
> Dmitry Vyukov  wrote:
> 
> > On Thu, Jul 26, 2018 at 11:29 AM, syzbot
> >  wrote:
> > > Hello,
> > >
> > > syzbot found the following crash on:
> > >
> > > HEAD commit:dc66fe43b7eb rds: send: Fix dead code in rds_sendmsg
> > > git tree:   net-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=127874c840
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=f34ce142a9f5f0e8
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=604f8271211546f5b3c7
> > > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > >
> > > Unfortunately, I don't have any reproducer for this crash yet.
> > >
> > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > Reported-by: syzbot+604f8271211546f5b...@syzkaller.appspotmail.com
> > >
> > > possible deadlock in static_key_slow_incsd 0:0:1:0: [sda] Attached SCSI 
> > > disk
> > > MACsec IEEE 802.1AE
> > > tun: Universal TUN/TAP device driver, 1.6
> > >
> > > 
> > > WARNING: possible recursive locking detected  
> > 
> > +Tetsuo, perhaps this boot lockdep problem then disables lockdep for
> > actual testing. I think lockdep should respect panic_on_warn.
> > 
> > 
> > > 4.18.0-rc6+ #141 Not tainted
> > > 
> > > swapper/0/1 is trying to acquire lock:
> > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> > > static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
> > >
> > > but task is already holding lock:
> > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> > > include/linux/cpu.h:126 [inline]
> > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
> > > init_vqs+0xe1a/0x1520
> > > drivers/net/virtio_net.c:2777
> 
> Here init_vqs() does:
> 
>   get_online_cpus();
>   virtnet_set_affinity(vi);
>   put_online_cpus();
> 
> Which disables cpu hotplug and calls virtnet_set_affinity()
> 
> Note, get_online_cpus() is no longer recursive.
> 
> > >
> > > other info that might help us debug this:
> > >  Possible unsafe locking scenario:
> > >
> > >CPU0
> > >
> > >   lock(cpu_hotplug_lock.rw_sem);
> > >   lock(cpu_hotplug_lock.rw_sem);
> > >
> > >  *** DEADLOCK ***
> > >
> > >  May be due to missing lock nesting notation
> > >
> > > 3 locks held by swapper/0/1:
> > >  #0: (ptrval) (>mutex){}, at: device_lock
> > > include/linux/device.h:1134 [inline]
> > >  #0: (ptrval) (>mutex){}, at: __driver_attach+0x15f/0x2f0
> > > drivers/base/dd.c:820
> > >  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> > > include/linux/cpu.h:126 [inline]
> > >  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> > > init_vqs+0xe1a/0x1520 drivers/net/virtio_net.c:2777
> > >  #2: (ptrval) (xps_map_mutex){+.+.}, at:
> > > __netif_set_xps_queue+0x243/0x23f0 net/core/dev.c:2278
> > >
> > > stack backtrace:
> > > CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6+ #141
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > Google 01/01/2011
> > > Call Trace:
> > >  __dump_stack lib/dump_stack.c:77 [inline]
> > >  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
> > >  print_deadlock_bug kernel/locking/lockdep.c:1765 [inline]
> > >  check_deadlock kernel/locking/lockdep.c:1809 [inline]
> > >  validate_chain kernel/locking/lockdep.c:2405 [inline]
> > >  __lock_acquire.cold.65+0x1fb/0x486 kernel/locking/lockdep.c:3435
> > >  lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
> > >  percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 [inline]
> > >  percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
> > >  cpus_read_lock+0x43/0xa0 kernel/cpu.c:289
> > >  static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
> > >  __netif_set_xps_queue+0xaac/0x23f0 net/core/dev.c:2320
> 
> 
> __netif_set_xps_queue() calls static_key_slow_inc() which will also do
> a get_online_cpus() which will trigger this bug.
> 
> There's a static_key_slow_inc_cpuslocked() version that should be used
> when get_online_cpus() is already taken, but I see
> __netif_set_xps_queue() is called from several places, and I doubt it
> is always called with get_online_cpus() held. Thus just using the
> cpuslocked() version is probably not sufficient of a fix.
> 
> I don't know the code enough to offer other suggestions.
> 
> -- Steve

OK so the guess is it's due to combination of

commit 04157469b7b848f4a9978b63b1ea2ce62ad3a0a3
Author: Amritha Nambiar 
Date:   Fri Jun 29 21:26:46 2018 -0700

net: Use static_key for XPS maps
 
which uses static_key_slow_inc and

commit 8af2c06ff4b144064b51b7f688194474123d9c9c
Author: Amritha Nambiar 
Date:   Fri Jun 29 21:27:07 2018 -0700

net-sysfs: Add interface for Rx queue(s) map per Tx queue


which makes it all 

Re: net-next boot error

2018-07-26 Thread Steven Rostedt


[ Added Thomas Gleixner ]


On Thu, 26 Jul 2018 11:34:39 +0200
Dmitry Vyukov  wrote:

> On Thu, Jul 26, 2018 at 11:29 AM, syzbot
>  wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:dc66fe43b7eb rds: send: Fix dead code in rds_sendmsg
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=127874c840
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=f34ce142a9f5f0e8
> > dashboard link: https://syzkaller.appspot.com/bug?extid=604f8271211546f5b3c7
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+604f8271211546f5b...@syzkaller.appspotmail.com
> >
> > possible deadlock in static_key_slow_incsd 0:0:1:0: [sda] Attached SCSI disk
> > MACsec IEEE 802.1AE
> > tun: Universal TUN/TAP device driver, 1.6
> >
> > 
> > WARNING: possible recursive locking detected  
> 
> +Tetsuo, perhaps this boot lockdep problem then disables lockdep for
> actual testing. I think lockdep should respect panic_on_warn.
> 
> 
> > 4.18.0-rc6+ #141 Not tainted
> > 
> > swapper/0/1 is trying to acquire lock:
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> > static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
> >
> > but task is already holding lock:
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> > include/linux/cpu.h:126 [inline]
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0xe1a/0x1520
> > drivers/net/virtio_net.c:2777

Here init_vqs() does:

get_online_cpus();
virtnet_set_affinity(vi);
put_online_cpus();

Which disables cpu hotplug and calls virtnet_set_affinity()

Note, get_online_cpus() is no longer recursive.

> >
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> >
> >CPU0
> >
> >   lock(cpu_hotplug_lock.rw_sem);
> >   lock(cpu_hotplug_lock.rw_sem);
> >
> >  *** DEADLOCK ***
> >
> >  May be due to missing lock nesting notation
> >
> > 3 locks held by swapper/0/1:
> >  #0: (ptrval) (>mutex){}, at: device_lock
> > include/linux/device.h:1134 [inline]
> >  #0: (ptrval) (>mutex){}, at: __driver_attach+0x15f/0x2f0
> > drivers/base/dd.c:820
> >  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> > include/linux/cpu.h:126 [inline]
> >  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> > init_vqs+0xe1a/0x1520 drivers/net/virtio_net.c:2777
> >  #2: (ptrval) (xps_map_mutex){+.+.}, at:
> > __netif_set_xps_queue+0x243/0x23f0 net/core/dev.c:2278
> >
> > stack backtrace:
> > CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6+ #141
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
> >  print_deadlock_bug kernel/locking/lockdep.c:1765 [inline]
> >  check_deadlock kernel/locking/lockdep.c:1809 [inline]
> >  validate_chain kernel/locking/lockdep.c:2405 [inline]
> >  __lock_acquire.cold.65+0x1fb/0x486 kernel/locking/lockdep.c:3435
> >  lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
> >  percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 [inline]
> >  percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
> >  cpus_read_lock+0x43/0xa0 kernel/cpu.c:289
> >  static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
> >  __netif_set_xps_queue+0xaac/0x23f0 net/core/dev.c:2320


__netif_set_xps_queue() calls static_key_slow_inc() which will also do
a get_online_cpus() which will trigger this bug.

There's a static_key_slow_inc_cpuslocked() version that should be used
when get_online_cpus() is already taken, but I see
__netif_set_xps_queue() is called from several places, and I doubt it
is always called with get_online_cpus() held. Thus just using the
cpuslocked() version is probably not sufficient of a fix.

I don't know the code enough to offer other suggestions.

-- Steve


> >  netif_set_xps_queue+0x26/0x30 net/core/dev.c:2455
> >  virtnet_set_affinity+0x2ba/0x4b0 drivers/net/virtio_net.c:1944
> >  init_vqs+0xe22/0x1520 drivers/net/virtio_net.c:2778
> >  virtnet_probe+0x1092/0x2260 drivers/net/virtio_net.c:3016
> >  virtio_dev_probe+0x592/0x942 drivers/virtio/virtio.c:245
> >  really_probe drivers/base/dd.c:446 [inline]
> >  driver_probe_device+0x6ad/0x970 drivers/base/dd.c:588
> >  __driver_attach+0x28b/0x2f0 drivers/base/dd.c:822
> >  bus_for_each_dev+0x15d/0x1f0 drivers/base/bus.c:311
> >  driver_attach+0x3d/0x50 drivers/base/dd.c:841
> >  bus_add_driver+0x4b2/0x600 drivers/base/bus.c:667
> >  driver_register+0x1c8/0x320 drivers/base/driver.c:170
> >  register_virtio_driver+0x79/0xd0 

Re: net-next boot error

2018-07-26 Thread Dmitry Vyukov via Virtualization
On Thu, Jul 26, 2018 at 11:29 AM, syzbot
 wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:dc66fe43b7eb rds: send: Fix dead code in rds_sendmsg
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=127874c840
> kernel config:  https://syzkaller.appspot.com/x/.config?x=f34ce142a9f5f0e8
> dashboard link: https://syzkaller.appspot.com/bug?extid=604f8271211546f5b3c7
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+604f8271211546f5b...@syzkaller.appspotmail.com
>
> possible deadlock in static_key_slow_incsd 0:0:1:0: [sda] Attached SCSI disk
> MACsec IEEE 802.1AE
> tun: Universal TUN/TAP device driver, 1.6
>
> 
> WARNING: possible recursive locking detected

+Tetsuo, perhaps this boot lockdep problem then disables lockdep for
actual testing. I think lockdep should respect panic_on_warn.


> 4.18.0-rc6+ #141 Not tainted
> 
> swapper/0/1 is trying to acquire lock:
> (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
>
> but task is already holding lock:
> (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> include/linux/cpu.h:126 [inline]
> (ptrval) (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0xe1a/0x1520
> drivers/net/virtio_net.c:2777
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>CPU0
>
>   lock(cpu_hotplug_lock.rw_sem);
>   lock(cpu_hotplug_lock.rw_sem);
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 3 locks held by swapper/0/1:
>  #0: (ptrval) (>mutex){}, at: device_lock
> include/linux/device.h:1134 [inline]
>  #0: (ptrval) (>mutex){}, at: __driver_attach+0x15f/0x2f0
> drivers/base/dd.c:820
>  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> include/linux/cpu.h:126 [inline]
>  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> init_vqs+0xe1a/0x1520 drivers/net/virtio_net.c:2777
>  #2: (ptrval) (xps_map_mutex){+.+.}, at:
> __netif_set_xps_queue+0x243/0x23f0 net/core/dev.c:2278
>
> stack backtrace:
> CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6+ #141
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
>  print_deadlock_bug kernel/locking/lockdep.c:1765 [inline]
>  check_deadlock kernel/locking/lockdep.c:1809 [inline]
>  validate_chain kernel/locking/lockdep.c:2405 [inline]
>  __lock_acquire.cold.65+0x1fb/0x486 kernel/locking/lockdep.c:3435
>  lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
>  percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 [inline]
>  percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
>  cpus_read_lock+0x43/0xa0 kernel/cpu.c:289
>  static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
>  __netif_set_xps_queue+0xaac/0x23f0 net/core/dev.c:2320
>  netif_set_xps_queue+0x26/0x30 net/core/dev.c:2455
>  virtnet_set_affinity+0x2ba/0x4b0 drivers/net/virtio_net.c:1944
>  init_vqs+0xe22/0x1520 drivers/net/virtio_net.c:2778
>  virtnet_probe+0x1092/0x2260 drivers/net/virtio_net.c:3016
>  virtio_dev_probe+0x592/0x942 drivers/virtio/virtio.c:245
>  really_probe drivers/base/dd.c:446 [inline]
>  driver_probe_device+0x6ad/0x970 drivers/base/dd.c:588
>  __driver_attach+0x28b/0x2f0 drivers/base/dd.c:822
>  bus_for_each_dev+0x15d/0x1f0 drivers/base/bus.c:311
>  driver_attach+0x3d/0x50 drivers/base/dd.c:841
>  bus_add_driver+0x4b2/0x600 drivers/base/bus.c:667
>  driver_register+0x1c8/0x320 drivers/base/driver.c:170
>  register_virtio_driver+0x79/0xd0 drivers/virtio/virtio.c:296
>  virtio_net_driver_init+0x8d/0xc9 drivers/net/virtio_net.c:3209
>  do_one_initcall+0x127/0x913 init/main.c:884
>  do_initcall_level init/main.c:952 [inline]
>  do_initcalls init/main.c:960 [inline]
>  do_basic_setup init/main.c:978 [inline]
>  kernel_init_freeable+0x49b/0x58e init/main.c:1135
>  kernel_init+0x11/0x1b3 init/main.c:1061
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
> vcan: Virtual CAN interface driver
> vxcan: Virtual CAN Tunnel driver
> slcan: serial line CAN interface driver
> slcan: 10 dynamic interface channels.
> CAN device driver interface
> enic: Cisco VIC Ethernet NIC Driver, ver 2.3.0.53
> e100: Intel(R) PRO/100 Network Driver, 3.5.24-k2-NAPI
> e100: Copyright(c) 1999-2006 Intel Corporation
> e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
> e1000: Copyright (c) 1999-2006 Intel Corporation.
> e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
> e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
> sky2: driver version 1.30
> PPP generic