[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

Andrea Righi Mon, 05 Aug 2019 02:19:06 -0700

Some additional info about the deadlock:

crash> bt 16588
PID: 16588  TASK: ffff9ffd7f332b00  CPU: 1   COMMAND: "bcache_allocato"
    [exception RIP: bch_crc64+57]
    RIP: ffffffffc093b2c9  RSP: ffffab9585767e28  RFLAGS: 00000286
    RAX: f1f51403756de2bd  RBX: 0000000000000000  RCX: 0000000000000065
    RDX: 0000000000000065  RSI: ffff9ffd63980000  RDI: ffff9ffd63925346
    RBP: ffffab9585767e28   R8: ffffffffc093db60   R9: ffffab9585739000
    R10: 000000000000007f  R11: 000000001ffef001  R12: 0000000000000000
    R13: 0000000000000008  R14: ffff9ffd63900000  R15: ffff9ffd683d0000
    CS: 0010  SS: 0018
 #0 [ffffab9585767e30] bch_prio_write at ffffffffc09325c0 [bcache]
 #1 [ffffab9585767eb0] bch_allocator_thread at ffffffffc091bdc5 [bcache]
 #2 [ffffab9585767f08] kthread at ffffffffa80b2481
 #3 [ffffab9585767f50] ret_from_fork at ffffffffa8a00205


crash> bt 14658
PID: 14658  TASK: ffff9ffd7a9f0000  CPU: 0   COMMAND: "python3"
 #0 [ffffab958380bb48] __schedule at ffffffffa89ae441
 #1 [ffffab958380bbe8] schedule at ffffffffa89aea7c
 #2 [ffffab958380bbf8] bch_bucket_alloc at ffffffffc091c370 [bcache]
 #3 [ffffab958380bc68] __bch_bucket_alloc_set at ffffffffc091c5ce [bcache]
 #4 [ffffab958380bcb8] bch_bucket_alloc_set at ffffffffc091c66e [bcache]
 #5 [ffffab958380bcf8] __uuid_write at ffffffffc0931b69 [bcache]
 #6 [ffffab958380bda0] bch_uuid_write at ffffffffc0931f76 [bcache]
 #7 [ffffab958380bdc0] __cached_dev_store at ffffffffc0937c08 [bcache]
 #8 [ffffab958380be20] bch_cached_dev_store at ffffffffc0938309 [bcache]
 #9 [ffffab958380be50] sysfs_kf_write at ffffffffa830c97c
#10 [ffffab958380be60] kernfs_fop_write at ffffffffa830c3e5
#11 [ffffab958380bea0] __vfs_write at ffffffffa827e5bb
#12 [ffffab958380beb0] vfs_write at ffffffffa827e781
#13 [ffffab958380bee8] sys_write at ffffffffa827e9fc
#14 [ffffab958380bf30] do_syscall_64 at ffffffffa8003b03
#15 [ffffab958380bf50] entry_SYSCALL_64_after_hwframe at ffffffffa8a00081
    RIP: 00007faffc7bd154  RSP: 00007ffe307cbc88  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000008  RCX: 00007faffc7bd154
    RDX: 0000000000000008  RSI: 00000000011ce7f0  RDI: 0000000000000003
    RBP: 00007faffccb86c0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000100  R11: 0000000000000246  R12: 0000000000000003
    R13: 0000000000000000  R14: 00000000011ce7f0  R15: 0000000000f33e60
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

In this case the task "python3" (pid 14658) gets stuck in a wait that
never completes from bch_bucket_alloc(). The task should that should
resume "python3" from this wait is "bcache_allocator" (pid 16588), but
the resume never happens, because bcache_allocator is stuck in this
"retry_invalidate" busy loop:

static int bch_allocator_thread(void *arg)
{
...
retry_invalidate:
                allocator_wait(ca, ca->set->gc_mark_valid &&
                               !ca->invalidate_needs_gc);
                invalidate_buckets(ca);

                /*
                 * Now, we write their new gens to disk so we can start writing
                 * new stuff to them:
                 */
                allocator_wait(ca, !atomic_read(&ca->set->prio_blocked));
                if (CACHE_SYNC(&ca->set->sb)) {
                        /*
                         * This could deadlock if an allocation with a btree
                         * node locked ever blocked - having the btree node
                         * locked would block garbage collection, but here we're
                         * waiting on garbage collection before we invalidate
                         * and free anything.
                         *
                         * But this should be safe since the btree code always
                         * uses btree_check_reserve() before allocating now, and
                         * if it fails it blocks without btree nodes locked.
                         */
                        if (!fifo_full(&ca->free_inc))
                                goto retry_invalidate;

                        if (bch_prio_write(ca, false) < 0) {
                                ca->invalidate_needs_gc = 1;
                                wake_up_gc(ca->set);
                                goto retry_invalidate;
                        }
                }
        }
...

The exact code path is this: bch_prio_write() fails, because it calls
bch_bucket_alloc() that fails (out of free buckets), it's waking up the
garbage collector (trying to free up some buckets) and it goes back to
retry_invalidate, but it's not enough apprently; bch_prio_write() is
going to fail over and over again (due to no buckets available), unable
to break out of the busy loop => deadlock.

Looking better at the code it seems safe to resume the bcache_allocator
main loop when bch_prio_write() fails (still keeping the wake_up event
to the garbage collector), instead of going back to retry_invalidate.
This should give the allocator a better chance to free up some buckets,
possibly preventing the "out of buckets" deadlock condition.

I'm currently testing a kernel with this change, if I can't trigger any
deadlock in the next hour or so, I'll upload a new test kernel.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~     2018-05-18 18:40:48.000000000 +0000
  +++ curtin/util.py      2018-10-05 09:40:06.807390367 +0000
  @@ -263,7 +263,7 @@
       return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
       if not path:
           raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

Reply via email to