On Mon, 23 Dec 2013 10:35:04 +0800
, Qu Wenruo wrote:
On fri, 20 Dec 2013 05:30:48 -0800, Josef Bacik wrote:

On 12/19/2013 07:08 PM, Qu Wenruo wrote:
I'm sorry but I failed to reproduce the problem.
Btrfs/012 in xfstests has been run for serveral hours but nothing happened.

Would you please give me some more details about the environment or the panic backtrace?


Ok so it wasn't that test, it was just ./check -g auto. It would sometimes die at btrfs/012, other times it would make it as far as generic/083 before it keeled over. I bisected it down to

btrfs: Replace fs_info->workers with btrfs_workqueue

which of course is just the first patch that you start using the new code which isn't helpful. My dmesg from one of my runs is here

http://ur1.ca/g867b

All of the initial panics looked exactly the same. I'm just going to kick the series out for now while you track down the problem. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Thanks for your dmesg a lot.

It's quite sad that I still failed to reproduce the same bug on all my test environment during the weekend.

But according to the dmesg, which is expired now though, I made the following assumption:

There is a possibility that out of order execution made the "work->wq = wq" sentence executed behind the "queue_work()" call, and the "normal_work_helper" is executed in another CPU instantly, which caused the problem.

------
static inline void __btrfs_queue_work(struct __btrfs_workqueue *wq,
struct btrfs_work *work)
{
unsigned long flags;

work->wq = wq;
/* There is no memory barrier ensure the work->wq = wq is executed
before queue_work */
thresh_queue_hook(wq);
if (work->ordered_func) {
spin_lock_irqsave(&wq->list_lock, flags);
list_add_tail(&work->ordered_list, &wq->ordered_list);
spin_unlock_irqrestore(&wq->list_lock, flags);
}
queue_work(wq->normal_wq, &work->normal_work);
}
------

The memory order problem involves compiler or even CPU hardware (AMD CPUS seems to have a weaker memory order than Intel), so to simulate the bug, I used the following codes tried to reproduce the bug, and the result is pretty much the same as your dmesg.
------
static inline void __btrfs_queue_work(struct __btrfs_workqueue *wq,
struct btrfs_work *work)
{
unsigned long flags;
int in_order = get_random_int() % 1000

if (in_order)
work->wq = wq;
thresh_queue_hook(wq);
if (work->ordered_func) {
spin_lock_irqsave(&wq->list_lock, flags);
list_add_tail(&work->ordered_list, &wq->ordered_list);
spin_unlock_irqrestore(&wq->list_lock, flags);
}
queue_work(wq->normal_wq, &work->normal_work);

/* Try make the codes out of order since the Intel CPU
seems obey the memory order quite well */
if (!in_order)
work->wq = wq;
}
------

The fix is as easy as just adding a smp_wmb() behind the "work->wq = wq" sentence, but since I failed to reproduce the bug,
I can't confirm whether the fix is good or not.

I know it's impolite but would you please first try the smp_wmb() first to confirm the fix?

If the fix is good I'll send the formal V5 patchset.

Thanks

Qu
Happy new year!

Very sorry for disturbing but ithas been some time before my last replay
and no feedback since then (mainly because of the new year holiday).

If itis convenient for you,would you please give some feedback about the smp_wmb() fix ?

Also incase that smp_wmb() doesn't work, it would be very nice of you to
also try smp_mb() fix.

Thanks
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to