On Mon, Mar 27, 2017 at 4:16 PM, Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote: > On Mon, Mar 27, 2017 at 02:36:35PM +0200, Dmitry Vyukov wrote: >> On Tue, Mar 14, 2017 at 5:21 PM, Paul E. McKenney >> <paul...@linux.vnet.ibm.com> wrote: >> > On Tue, Mar 14, 2017 at 12:47:02AM -0700, Lance Roy wrote: >> >> I am not sure how the rcu_scheduler_active changes in __synchronize_srcu >> >> work, >> >> but there seem to be a few problems in them. First, >> >> "if (done && likely(!driving))" on line 453 doesn't appear to ever happen, >> >> as driving doesn't get set to false when srcu_reschedule is called. This >> >> seems >> >> like it could cause a race condition if another thread notices that >> >> ->running is >> >> false, adds itself to the queue, set ->running to true, and starts on its >> >> own >> >> grace period before the first thread acquires the lock again on line 469. >> >> Then >> >> the first thread will then acquire the lock, set running to false, and >> >> release >> >> the lock, resulting in an invalid state where ->running is false but the >> >> second >> >> thread is still trying to finish its grace period. >> >> >> >> Second, the while loop on line 455 seems to violate to rule that ->running >> >> shouldn't be false when there are entries in the queue. If a second >> >> thread adds >> >> itself to the queue while the first thread is driving SRCU inside that >> >> loop, and >> >> then the first thread finishes its own grace period and quits the loop, >> >> it will >> >> set ->running to false even though there is still a thread on the queue. >> >> >> >> The second issue requires rcu_scheduler_active to be RCU_SCHEDULER_INIT to >> >> occur, and as I don't what the assumptions during RCU_SCHEDULER_INIT are >> >> I don't >> >> know if it is actually a problem, but the first issue looks like it could >> >> occur >> >> at any time. >> > >> > Thank you for looking into this! >> > >> > I determined that my patch-order strategy was flawed, as it required >> > me to rewrite the mid-boot functionality several times. I therefore >> > removed the mid-boot commits. I will add them in later, but they will >> > use a rather different approach based on a grace-period sequence number >> > similar to that used by the expedited grace periods. >> > >> > Which should also teach me to be less aggressive about pushing new code >> > to -next. For a few weeks, anyway. ;-) >> > >> > Thanx, Paul >> > >> >> Thanks, >> >> Lance >> >> >> >> On Fri, 10 Mar 2017 14:26:09 -0800 >> >> "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote: >> >> > On Fri, Mar 10, 2017 at 08:29:55PM +0100, Andrey Konovalov wrote: >> >> > > On Fri, Mar 10, 2017 at 8:28 PM, Andrey Konovalov >> >> > > <andreyk...@google.com> >> >> > > wrote: >> >> > > > Kernel panic - not syncing: Fatal exception >> >> > >> >> > So the theory is that if !sp->running, all of SRCU's queues must be >> >> > empty. >> >> > So if you are holding ->queue_lock (with irqs disabled) and you see >> >> > !sp->running, and then you enqueue a callback on ->batch_check0, then >> >> > that callback must be the first in the list. And the code preceding >> >> > the WARN_ON() you triggered does in fact check and enqueue shile holding >> >> > ->queue_lock with irqs disabled. >> >> > >> >> > And rcu_batch_queue() does operate FIFO as required. (Otherwise, >> >> > srcu_barrier() would not work.) >> >> > >> >> > There are only three calls to rcu_batch_queue(), and the one involved >> >> > with >> >> > the WARN_ON() enqueues to ->batch_check0. The other two enqueue to >> >> > ->batch_queue. Callbacks move from ->batch_queue to ->batch_check0 to >> >> > ->batch_check1 to ->batch_done, so nothing should slip in front. >> >> > >> >> > Of course, if ->running were ever set to false with any of >> >> > ->batch_check0, >> >> > ->batch_check1, or ->batch_done non-empty, this WARN_ON() would trigger. >> >> > But srcu_reschedule() sets it to false only if all four batches are >> >> > empty (and checks and sets under ->queue_lock()), and all other cases >> >> > where it is set to false happen at initialization time, and also clear >> >> > out the queues. Of course, if someone raced an init_srcu_struct() with >> >> > either a call_srcu() or synchronize_srcu(), all bets are off. Now, >> >> > mmu_notifier.c does invoke init_srcu_struct() manually, but it does >> >> > so at subsys_initcall() time. Which -might- be after other things are >> >> > happening, so one "hail Mary" attempted fix is to remove >> >> > mmu_notifier_init() >> >> > and replace the "static struct srcu_struct srcu" with: >> >> >> >> > >> >> > DEFINE_STATIC_SRCU(srcu); >> >> > >> >> > But this might require changing the name -- I vaguely recall some >> >> > strangeness where the names of statically defined per-CPU variables need >> >> > to be globally unique even when static. Easy enough to do, though. >> >> > Might need a similar change to the "srcu" instances defined in vmd.c >> >> > and kvm_host.h -- assuming that this change helps. >> >> > >> >> > Another possibility is that something in SRCU is messing with either the >> >> > queues or the ->running field without holding ->queue_lock. And that >> >> > does >> >> > seem to be happening -- srcu_advance_batches() invokes rcu_batch_move() >> >> > without holding anything. Which seems like it could cause trouble >> >> > if someone else was invoking synchronize_srcu() concurrently. Those >> >> > particular invocations might be safe due to access only by a single >> >> > kthread/workqueue, given that all updates to ->batch_queue are protected >> >> > by ->queue_lock (aside from initialization). >> >> > >> >> > But ->batch_check0 is updated by __synchronize_srcu(), though protected >> >> > by ->queue_lock, and only if ->running is false, and with both the >> >> > check and the update protected by the same ->queue_lock critical >> >> > section. >> >> > If ->running is false, then the workqueue cannot be running, so it >> >> > remains >> >> > to see if all other updates to ->batch_check0 are either with >> >> > ->queue_lock >> >> > held and ->running false on the one hand or from the workqueue handler >> >> > on the other: >> >> > >> >> > srcu_collect_new() updates with ->queue_lock held, but does not check >> >> > ->running. It is invoked only from process_srcu(), which in >> >> > turn is invoked only as a workqueue handler. The work is queued >> >> > from: >> >> > >> >> > call_srcu(), which does so with ->queue_lock held having just >> >> > set ->running to true. >> >> > >> >> > srcu_reschedule(), which invokes it if there are non-empty >> >> > queues. This is invoked from __synchronize_srcu() >> >> > in the case where it has set ->running to true >> >> > after finding the queues empty, which should imply >> >> > no other instances. >> >> > >> >> > It is also invoked from process_srcu(), which is >> >> > invoked only as a workqueue handler. (Yay >> >> > recursive inquiry!) >> >> > >> >> > srcu_advance_batches() updates without locks held. It is invoked as >> >> > follows: >> >> > >> >> > __synchronize_srcu() in the case where ->running was set, which >> >> > as noted before excludes workqueue handlers. >> >> > >> >> > process_srcu() which as noted before is only invoked from >> >> > a workqueue handler. >> >> > >> >> > So an SRCU workqueue is invoked only from a workqueue handler, or from >> >> > some other task that transitioned ->running from false to true while >> >> > holding ->queuelock. There should therefore only be one SRCU workqueue >> >> > per srcu_struct, so this should be safe. Though I hope that it can >> >> > be simplified a bit. :-/ >> >> > >> >> > So the only suggestion I have at the moment is static definition of >> >> > the "srcu" variable. Lai, Josh, Steve, Mathieu, anything I missed? >> >> > >> >> > Thanx, Paul >> >> >> >> This happened on linux-next/65b2dc38291f9f27e5ec3b804d6eb3b5f79a3ce4 >> and may be related. >> The report says that srcu subsystem still uses the srcu object after >> it has been freed. It can be a kvm fault as well. > > Hmmm... I am not seeing a call to cleanup_srcu_struct() for the > ->track_srcu field of the kvm_page_track_notifier_head structure. > Or is this structure immortal, so that it is never cleaned up? > Or am I just blind this morning? > > In any case, freeing the kvm_page_track_notifier_head structure > without first invoking cleanup_srcu_struct() on its ->track_srcu > srcu_struct field could easily result in a use-after-free bug.
Sent this to kvm people: https://groups.google.com/d/msg/syzkaller/Sl0POwca6-s/QR_z6AsFCQAJ >> ================================================================== >> BUG: KASAN: use-after-free in debug_spin_unlock >> kernel/locking/spinlock_debug.c:97 [inline] >> BUG: KASAN: use-after-free in do_raw_spin_unlock+0x2ea/0x320 >> kernel/locking/spinlock_debug.c:134 >> Read of size 4 at addr ffff88014158a564 by task kworker/1:1/5712 >> >> CPU: 1 PID: 5712 Comm: kworker/1:1 Not tainted 4.11.0-rc3-next-20170324+ #1 >> Hardware name: Google Google Compute Engine/Google Compute Engine, >> BIOS Google 01/01/2011 >> Workqueue: events_power_efficient process_srcu >> Call Trace: >> __dump_stack lib/dump_stack.c:16 [inline] >> dump_stack+0x2fb/0x40f lib/dump_stack.c:52 >> print_address_description+0x7f/0x260 mm/kasan/report.c:250 >> kasan_report_error mm/kasan/report.c:349 [inline] >> kasan_report.part.3+0x21f/0x310 mm/kasan/report.c:372 >> kasan_report mm/kasan/report.c:392 [inline] >> __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:392 >> debug_spin_unlock kernel/locking/spinlock_debug.c:97 [inline] >> do_raw_spin_unlock+0x2ea/0x320 kernel/locking/spinlock_debug.c:134 >> __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:167 [inline] >> _raw_spin_unlock_irq+0x22/0x70 kernel/locking/spinlock.c:199 >> spin_unlock_irq include/linux/spinlock.h:349 [inline] >> srcu_reschedule+0x1a1/0x260 kernel/rcu/srcu.c:582 >> process_srcu+0x63c/0x11c0 kernel/rcu/srcu.c:600 >> process_one_work+0xac0/0x1b00 kernel/workqueue.c:2097 >> worker_thread+0x1b4/0x1300 kernel/workqueue.c:2231 >> kthread+0x36c/0x440 kernel/kthread.c:231 >> ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430 >> >> Allocated by task 20961: >> save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 >> save_stack+0x43/0xd0 mm/kasan/kasan.c:515 >> set_track mm/kasan/kasan.c:527 [inline] >> kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:619 >> kmem_cache_alloc_trace+0x10b/0x670 mm/slab.c:3635 >> kmalloc include/linux/slab.h:492 [inline] >> kzalloc include/linux/slab.h:665 [inline] >> kvm_arch_alloc_vm include/linux/kvm_host.h:773 [inline] >> kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:610 [inline] >> kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 >> [inline] >> kvm_dev_ioctl+0x1bf/0x1460 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3205 >> vfs_ioctl fs/ioctl.c:45 [inline] >> do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685 >> SYSC_ioctl fs/ioctl.c:700 [inline] >> SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 >> entry_SYSCALL_64_fastpath+0x1f/0xbe >> >> Freed by task 20960: >> save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 >> save_stack+0x43/0xd0 mm/kasan/kasan.c:515 >> set_track mm/kasan/kasan.c:527 [inline] >> kasan_slab_free+0x6e/0xc0 mm/kasan/kasan.c:592 >> __cache_free mm/slab.c:3511 [inline] >> kfree+0xd3/0x250 mm/slab.c:3828 >> kvm_arch_free_vm include/linux/kvm_host.h:778 [inline] >> kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:732 [inline] >> kvm_put_kvm+0x709/0x9a0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:747 >> kvm_vm_release+0x42/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:758 >> __fput+0x332/0x800 fs/file_table.c:209 >> ____fput+0x15/0x20 fs/file_table.c:245 >> task_work_run+0x197/0x260 kernel/task_work.c:116 >> exit_task_work include/linux/task_work.h:21 [inline] >> do_exit+0x1a53/0x27c0 kernel/exit.c:878 >> do_group_exit+0x149/0x420 kernel/exit.c:982 >> get_signal+0x7d8/0x1820 kernel/signal.c:2318 >> do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808 >> exit_to_usermode_loop+0x21c/0x2d0 arch/x86/entry/common.c:157 >> prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline] >> syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:263 >> entry_SYSCALL_64_fastpath+0xbc/0xbe >> >> The buggy address belongs to the object at ffff880141581640 >> which belongs to the cache kmalloc-65536 of size 65536 >> The buggy address is located 36644 bytes inside of >> 65536-byte region [ffff880141581640, ffff880141591640) >> The buggy address belongs to the page: >> page:ffffea000464b400 count:1 mapcount:0 mapping:ffff880141581640 >> index:0x0 compound_mapcount: 0 >> flags: 0x200000000008100(slab|head) >> raw: 0200000000008100 ffff880141581640 0000000000000000 0000000100000001 >> raw: ffffea00064b1f20 ffffea000640fa20 ffff8801db800d00 >> page dumped because: kasan: bad access detected >> >> Memory state around the buggy address: >> ffff88014158a400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff88014158a480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> >ffff88014158a500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ^ >> ffff88014158a580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff88014158a600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ================================================================== >> >