Am Dienstag, 12. August 2014, 15:44:59 schrieb Liu Bo:
> This has been reported and discussed for a long time, and this hang occurs
> in both 3.15 and 3.16.

Liu, is this safe for testing yet?

Thanks,
Martin

> Btrfs now migrates to use kernel workqueue, but it introduces this hang
> problem.
> 
> Btrfs has a kind of work queued as an ordered way, which means that its
> ordered_func() must be processed in the way of FIFO, so it usually looks
> like --
> 
> normal_work_helper(arg)
>     work = container_of(arg, struct btrfs_work, normal_work);
> 
>     work->func() <---- (we name it work X)
>     for ordered_work in wq->ordered_list
>             ordered_work->ordered_func()
>             ordered_work->ordered_free()
> 
> The hang is a rare case, first when we find free space, we get an uncached
> block group, then we go to read its free space cache inode for free space
> information, so it will
> 
> file a readahead request
>     btrfs_readpages()
>          for page that is not in page cache
>                 __do_readpage()
>                      submit_extent_page()
>                            btrfs_submit_bio_hook()
>                                  btrfs_bio_wq_end_io()
>                                  submit_bio()
>                                  end_workqueue_bio() <--(ret by the 1st
> endio) queue a work(named work Y) for the 2nd also the real endio()
> 
> So the hang occurs when work Y's work_struct and work X's work_struct
> happens to share the same address.
> 
> A bit more explanation,
> 
> A,B,C -- struct btrfs_work
> arg   -- struct work_struct
> 
> kthread:
> worker_thread()
>     pick up a work_struct from @worklist
>     process_one_work(arg)
>       worker->current_work = arg;  <-- arg is A->normal_work
>       worker->current_func(arg)
>               normal_work_helper(arg)
>                    A = container_of(arg, struct btrfs_work, normal_work);
> 
>                    A->func()
>                    A->ordered_func()
>                    A->ordered_free()  <-- A gets freed
> 
>                    B->ordered_func()
>                         submit_compressed_extents()
>                             find_free_extent()
>                                 load_free_space_inode()
>                                     ...   <-- (the above readhead stack)
>                                     end_workqueue_bio()
>                                          btrfs_queue_work(work C)
>                    B->ordered_free()
> 
> As if work A has a high priority in wq->ordered_list and there are more
> ordered works queued after it, such as B->ordered_func(), its memory could
> have been freed before normal_work_helper() returns, which means that
> kernel workqueue code worker_thread() still has worker->current_work
> pointer to be work A->normal_work's, ie. arg's address.
> 
> Meanwhile, work C is allocated after work A is freed, work C->normal_work
> and work A->normal_work are likely to share the same address(I confirmed
> this with ftrace output, so I'm not just guessing, it's rare though).
> 
> When another kthread picks up work C->normal_work to process, and finds our
> kthread is processing it(see find_worker_executing_work()), it'll think
> work C as a collision and skip then, which ends up nobody processing work C.
> 
> So the situation is that our kthread is waiting forever on work C.
> 
> The key point is that they shouldn't have the same address, so this defers
> ->ordered_free() and does a batched free to avoid that.
> 
> Signed-off-by: Liu Bo <bo.li....@oracle.com>
> ---
>  fs/btrfs/async-thread.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 5a201d8..2ac01b3 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -195,6 +195,7 @@ static void run_ordered_work(struct __btrfs_workqueue
> *wq) struct btrfs_work *work;
>       spinlock_t *lock = &wq->list_lock;
>       unsigned long flags;
> +     LIST_HEAD(free_list);
> 
>       while (1) {
>               spin_lock_irqsave(lock, flags);
> @@ -219,17 +220,24 @@ static void run_ordered_work(struct __btrfs_workqueue
> *wq)
> 
>               /* now take the lock again and drop our item from the list */
>               spin_lock_irqsave(lock, flags);
> -             list_del(&work->ordered_list);
> +             list_move_tail(&work->ordered_list, &free_list);
>               spin_unlock_irqrestore(lock, flags);
> 
>               /*
>                * we don't want to call the ordered free functions
>                * with the lock held though
>                */
> +     }
> +     spin_unlock_irqrestore(lock, flags);
> +
> +     while (!list_empty(&free_list)) {
> +             work = list_entry(free_list.next, struct btrfs_work,
> +                               ordered_list);
> +
> +             list_del(&work->ordered_list);
>               work->ordered_free(work);
>               trace_btrfs_all_work_done(work);
>       }
> -     spin_unlock_irqrestore(lock, flags);
>  }
> 
>  static void normal_work_helper(struct work_struct *arg)

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to