Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 23:23 +0200, Alexandru Moise wrote:
> Hi, I tested it, it doesn't solve the problem.
> By the time you get here it's already too late, my patch
> prevents this from failing in the first place.

Hello Alex,

If you can share the steps to follow to trigger the bug you reported then
I will have a closer look at this.

Thanks,

Bart.





Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Alexandru Moise
On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> On 04/11/18 13:00, Alexandru Moise wrote:
> > But the root cause of it is in blkcg_init_queue() when blkg_create() returns
> > an ERR ptr, because it tries to insert into a populated index into 
> > blkcg->blkg_tree,
> > the entry that we fail to remove at __blk_release_queue().
> 
> Hello Alex,
> 
> Had you considered something like the untested patch below?
> 
> Thanks,
> 
> Bart.
> 
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 1c16694ae145..f2ced19e74b8 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1191,14 +1191,17 @@ int blkcg_init_queue(struct request_queue *q)
>   if (preloaded)
>   radix_tree_preload_end();
> 
> - if (IS_ERR(blkg))
> - return PTR_ERR(blkg);
> + if (IS_ERR(blkg)) {
> + ret = PTR_ERR(blkg);
> + goto destroy_all;
> + }
> 
>   q->root_blkg = blkg;
>   q->root_rl.blkg = blkg;
> 
>   ret = blk_throtl_init(q);
>   if (ret) {
> +destroy_all:
>   spin_lock_irq(q->queue_lock);
>   blkg_destroy_all(q);
>   spin_unlock_irq(q->queue_lock);
> 

Hi, I tested it, it doesn't solve the problem.
By the time you get here it's already too late, my patch
prevents this from failing in the first place.

I would have liked this more than my solution though.

../Alex



Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 13:02 -0700, t...@kernel.org wrote:
> On Wed, Apr 11, 2018 at 08:00:29PM +, Bart Van Assche wrote:
> > On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote:
> > > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> > > > On 04/11/18 13:00, Alexandru Moise wrote:
> > > > > But the root cause of it is in blkcg_init_queue() when blkg_create() 
> > > > > returns
> > > > > an ERR ptr, because it tries to insert into a populated index into 
> > > > > blkcg->blkg_tree,
> > > > > the entry that we fail to remove at __blk_release_queue().
> > > > 
> > > > Hello Alex,
> > > > 
> > > > Had you considered something like the untested patch below?
> > > 
> > > But queue init shouldn't fail here, right?
> > 
> > Hello Tejun,
> > 
> > Your question is not entirely clear to me. Are you referring to the atomic
> > allocations in blkg_create() or are you perhaps referring to something else?
> 
> Hmm.. maybe I'm confused but I thought that the fact that
> blkcg_init_queue() fails itself is already a bug, which happens
> because a previously destroyed queue left behind blkgs.

Hello Tejun,

I had missed the start of this thread so I was not aware of which problem Alex
was trying to solve. In the description of v1 of this patch I read that Alex
thinks that he ran into a scenario in which blk_queue_alloc_node() assigns a
q->id that is still in use by another request queue? That's weird. The following
code still occurs in __blk_release_queue():

ida_simple_remove(_queue_ida, q->id);

It's not clear to me how that remove call could happen *before* q->id is removed
from the blkcg radix tree.

Bart.





Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
On Wed, Apr 11, 2018 at 08:00:29PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote:
> > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> > > On 04/11/18 13:00, Alexandru Moise wrote:
> > > > But the root cause of it is in blkcg_init_queue() when blkg_create() 
> > > > returns
> > > > an ERR ptr, because it tries to insert into a populated index into 
> > > > blkcg->blkg_tree,
> > > > the entry that we fail to remove at __blk_release_queue().
> > > 
> > > Hello Alex,
> > > 
> > > Had you considered something like the untested patch below?
> > 
> > But queue init shouldn't fail here, right?
> 
> Hello Tejun,
> 
> Your question is not entirely clear to me. Are you referring to the atomic
> allocations in blkg_create() or are you perhaps referring to something else?

Hmm.. maybe I'm confused but I thought that the fact that
blkcg_init_queue() fails itself is already a bug, which happens
because a previously destroyed queue left behind blkgs.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote:
> On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> > On 04/11/18 13:00, Alexandru Moise wrote:
> > > But the root cause of it is in blkcg_init_queue() when blkg_create() 
> > > returns
> > > an ERR ptr, because it tries to insert into a populated index into 
> > > blkcg->blkg_tree,
> > > the entry that we fail to remove at __blk_release_queue().
> > 
> > Hello Alex,
> > 
> > Had you considered something like the untested patch below?
> 
> But queue init shouldn't fail here, right?

Hello Tejun,

Your question is not entirely clear to me. Are you referring to the atomic
allocations in blkg_create() or are you perhaps referring to something else?

Bart.




Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> On 04/11/18 13:00, Alexandru Moise wrote:
> >But the root cause of it is in blkcg_init_queue() when blkg_create() returns
> >an ERR ptr, because it tries to insert into a populated index into 
> >blkcg->blkg_tree,
> >the entry that we fail to remove at __blk_release_queue().
> 
> Hello Alex,
> 
> Had you considered something like the untested patch below?

But queue init shouldn't fail here, right?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche

On 04/11/18 13:00, Alexandru Moise wrote:

But the root cause of it is in blkcg_init_queue() when blkg_create() returns
an ERR ptr, because it tries to insert into a populated index into 
blkcg->blkg_tree,
the entry that we fail to remove at __blk_release_queue().


Hello Alex,

Had you considered something like the untested patch below?

Thanks,

Bart.


diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1c16694ae145..f2ced19e74b8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1191,14 +1191,17 @@ int blkcg_init_queue(struct request_queue *q)
if (preloaded)
radix_tree_preload_end();

-   if (IS_ERR(blkg))
-   return PTR_ERR(blkg);
+   if (IS_ERR(blkg)) {
+   ret = PTR_ERR(blkg);
+   goto destroy_all;
+   }

q->root_blkg = blkg;
q->root_rl.blkg = blkg;

ret = blk_throtl_init(q);
if (ret) {
+destroy_all:
spin_lock_irq(q->queue_lock);
blkg_destroy_all(q);
spin_unlock_irq(q->queue_lock);



Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Alexandru Moise
On Wed, Apr 11, 2018 at 03:54:53PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote:
> > [0.76] BUG: unable to handle kernel NULL pointer dereference at 
> > 01b4
> > [0.763350] Kernel panic - not syncing: Attempted to kill init! 
> > exitcode=0x0009
> > [0.763350]
> > [0.76] PGD 0 P4D 0
> > [0.76] Oops:  [#2] PREEMPT SMP
> > [0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G  D  
> > 4.16.0-ARCH+ #81
> > [0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.11.0-20171110_100015-anatol 04/01/20$
> > [0.76] Workqueue: nvme-reset-wq nvme_reset_work
> > [0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40
> > [0.76] RSP: :c91bfcb0 EFLAGS: 00010246
> > [0.76] RAX: 88003b698000 RBX:  RCX: 
> > 
> > [0.76] RDX: 88003b698000 RSI: fff4 RDI: 
> > 001c
> > [0.76] RBP: c91bfcc0 R08:  R09: 
> > 
> > [0.76] R10: eaeaa980 R11: 814e0970 R12: 
> > 001c
> > [0.76] R13:  R14:  R15: 
> > 88003aad8010
> > [0.76] FS:  () GS:88003e40() 
> > knlGS:
> > [0.76] CS:  0010 DS:  ES:  CR0: 80050033
> > [0.76] CR2: 01b4 CR3: 02209001 CR4: 
> > 000606f0
> > [0.76] Call Trace:
> > [0.76]  blk_mq_quiesce_queue+0x23/0x80
> > [0.76]  nvme_dev_disable+0x34f/0x480
> > [0.76]  ? nvme_irq+0x50/0x50
> > [0.76]  ? dev_warn+0x64/0x80
> > [0.76]  nvme_reset_work+0x13de/0x1570
> > [0.76]  ? __switch_to_asm+0x34/0x70
> > [0.76]  ? __switch_to_asm+0x40/0x70
> > [0.76]  ? _raw_spin_unlock_irq+0x15/0x30
> > [0.76]  ? finish_task_switch+0x156/0x210
> > [0.76]  process_one_work+0x20c/0x3d0
> > [0.76]  worker_thread+0x216/0x400
> > [0.76]  kthread+0x125/0x130
> > [0.76]  ? process_one_work+0x3d0/0x3d0
> > [0.76]  ? __kthread_bind_mask+0x60/0x60
> > [0.76]  ret_from_fork+0x3a/0x50
> 
> Hello Alexandru,
> 
> What made you look at cgroups? In the above register dump I see that %rbx == 
> NULL.
> I think that means that the queue pointer argument of blk_queue_flag_set() is 
> NULL.
> The NVMe initiator driver should never pass a NULL pointer to 
> blk_mq_quiesce_queue().
> Please ask the NVMe driver maintainers for their opinion on the linux-nvme 
> mailing
> list.
> 
> Thanks,
> 
> Bart.

The %rbx == NULL is only a symptom of the cgroup mishandling, perhaps we could
improve error handling in the NVMe driver, but I can say that about a lot of 
block
drivers actually, perhaps I will write some patches in the future to improve the
error handling.

But the root cause of it is in blkcg_init_queue() when blkg_create() returns
an ERR ptr, because it tries to insert into a populated index into 
blkcg->blkg_tree,
the entry that we fail to remove at __blk_release_queue().

Thanks,
../Alex


> 
> 
> 


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:26:12PM +, Bart Van Assche wrote:
> Please explain what you wrote further. It's not clear to me why you think
> that anything is broken nor what a "sever model" is.

So, sever (or revoke) model is where you actively disconnect an object
and ensuring that there'll be no further references from others.  In
contrast, what we usually do is refcnting, where we don't actively
sever the dying object from its users but let the users drain
themselves over time and destroy the object when we know all the users
are gone (recnt reaching zero).

> I think we really need the blkcg_exit_queue() call in blk_cleanup_queue()
> to avoid race conditions between request queue cleanup and the block cgroup
> controller. Additionally, since it is guaranteed that no new requests will
> be submitted to a queue after it has been marked dead I don't see why it
> would make sense to keep the association between a request queue and the
> blkcg controller until the last reference on a queue is dropped.

Sure, new requests aren't the only source tho.  e.g. there can be
derefs through sysfs / cgroupfs or writeback attempts on dead devices.
If you want to implement sever, you gotta hunt down all those and make
sure they can't make no further derefs.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 10:15 -0700, t...@kernel.org wrote:
> On Wed, Apr 11, 2018 at 05:06:41PM +, Bart Van Assche wrote:
> > A simple and effective solution is to dissociate a request queue from the
> > block cgroup controller before blk_cleanup_queue() returns. This is why 
> > commit
> > a063057d7c73 ("block: Fix a race between request queue removal and the block
> > cgroup controller") moved the blkcg_exit_queue() call from 
> > __blk_release_queue()
> > into blk_cleanup_queue().
> 
> which is broken.  We can try to switch the lifetime model to revoking
> all live objects but that likely is a lot more involving than blindly
> moving blkg shootdown from release to cleanup.  Implementing sever
> semantics is usually a lot more involved / fragile because it requires
> explicit participation from all users (exactly the same way revoking
> ->queue_lock is difficult).
> 
> I'm not necessarily against switching to sever model, but what the
> patch did isn't that.  It just moved some code without actually
> understanding or auditing what the implications are.

Hello Tejun,

Please explain what you wrote further. It's not clear to me why you think
that anything is broken nor what a "sever model" is.

I think we really need the blkcg_exit_queue() call in blk_cleanup_queue()
to avoid race conditions between request queue cleanup and the block cgroup
controller. Additionally, since it is guaranteed that no new requests will
be submitted to a queue after it has been marked dead I don't see why it
would make sense to keep the association between a request queue and the
blkcg controller until the last reference on a queue is dropped.

Bart.





Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:06:41PM +, Bart Van Assche wrote:
> A simple and effective solution is to dissociate a request queue from the
> block cgroup controller before blk_cleanup_queue() returns. This is why commit
> a063057d7c73 ("block: Fix a race between request queue removal and the block
> cgroup controller") moved the blkcg_exit_queue() call from 
> __blk_release_queue()
> into blk_cleanup_queue().

which is broken.  We can try to switch the lifetime model to revoking
all live objects but that likely is a lot more involving than blindly
moving blkg shootdown from release to cleanup.  Implementing sever
semantics is usually a lot more involved / fragile because it requires
explicit participation from all users (exactly the same way revoking
->queue_lock is difficult).

I'm not necessarily against switching to sever model, but what the
patch did isn't that.  It just moved some code without actually
understanding or auditing what the implications are.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 10:00 -0700, t...@kernel.org wrote:
> On Wed, Apr 11, 2018 at 04:42:55PM +, Bart Van Assche wrote:
> > On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> > > And looking at the change, it looks like the right thing we should
> > > have done is caching @lock on the print_blkg side and when switching
> > > locks make sure both locks are held.  IOW, do the following in
> > > blk_cleanup_queue()
> > > 
> > >   spin_lock_irq(lock);
> > >   if (q->queue_lock != >__queue_lock) {
> > >   spin_lock(>__queue_lock);
> > >   q->queue_lock = >__queue_lock;
> > >   spin_unlock(>__queue_lock);
> > >   }
> > >   spin_unlock_irq(lock);
> > > 
> > > Otherwise, there can be two lock holders thinking they have exclusive
> > > access to the request_queue.
> > 
> > I think that's a bad idea. A block driver is allowed to destroy the
> > spinlock it associated with the request queue as soon as blk_cleanup_queue()
> > has finished. If the block cgroup controller would cache a pointer to the
> > block driver spinlock then that could cause the cgroup code to attempt to
> > lock a spinlock after it has been destroyed. I don't think we need that kind
> > of race conditions.
> 
> I see, but that problem is there with or without caching as long as we
> have queu_lock usage which reach beyond cleanup_queue, right?  Whether
> that user caches the lock for matching unlocking or not doesn't really
> change the situation.
> 
> Short of adding protection around queue_lock switching, I can't think
> of a solution tho.  Probably the right thing to do is adding queue
> lock/unlock helpers which are safe to use beyond cleanup_queue.

Hello Tejun,

A simple and effective solution is to dissociate a request queue from the
block cgroup controller before blk_cleanup_queue() returns. This is why commit
a063057d7c73 ("block: Fix a race between request queue removal and the block
cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue()
into blk_cleanup_queue().

Thanks,

Bart.






Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 04:42:55PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> > And looking at the change, it looks like the right thing we should
> > have done is caching @lock on the print_blkg side and when switching
> > locks make sure both locks are held.  IOW, do the following in
> > blk_cleanup_queue()
> > 
> > spin_lock_irq(lock);
> > if (q->queue_lock != >__queue_lock) {
> > spin_lock(>__queue_lock);
> > q->queue_lock = >__queue_lock;
> > spin_unlock(>__queue_lock);
> > }
> > spin_unlock_irq(lock);
> > 
> > Otherwise, there can be two lock holders thinking they have exclusive
> > access to the request_queue.
> 
> I think that's a bad idea. A block driver is allowed to destroy the
> spinlock it associated with the request queue as soon as blk_cleanup_queue()
> has finished. If the block cgroup controller would cache a pointer to the
> block driver spinlock then that could cause the cgroup code to attempt to
> lock a spinlock after it has been destroyed. I don't think we need that kind
> of race conditions.

I see, but that problem is there with or without caching as long as we
have queu_lock usage which reach beyond cleanup_queue, right?  Whether
that user caches the lock for matching unlocking or not doesn't really
change the situation.

Short of adding protection around queue_lock switching, I can't think
of a solution tho.  Probably the right thing to do is adding queue
lock/unlock helpers which are safe to use beyond cleanup_queue.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> And looking at the change, it looks like the right thing we should
> have done is caching @lock on the print_blkg side and when switching
> locks make sure both locks are held.  IOW, do the following in
> blk_cleanup_queue()
> 
>   spin_lock_irq(lock);
>   if (q->queue_lock != >__queue_lock) {
>   spin_lock(>__queue_lock);
>   q->queue_lock = >__queue_lock;
>   spin_unlock(>__queue_lock);
>   }
>   spin_unlock_irq(lock);
> 
> Otherwise, there can be two lock holders thinking they have exclusive
> access to the request_queue.

I think that's a bad idea. A block driver is allowed to destroy the
spinlock it associated with the request queue as soon as blk_cleanup_queue()
has finished. If the block cgroup controller would cache a pointer to the
block driver spinlock then that could cause the cgroup code to attempt to
lock a spinlock after it has been destroyed. I don't think we need that kind
of race conditions.

Bart.





Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote:
> [0.76] BUG: unable to handle kernel NULL pointer dereference at 
> 01b4
> [0.763350] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x0009
> [0.763350]
> [0.76] PGD 0 P4D 0
> [0.76] Oops:  [#2] PREEMPT SMP
> [0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G  D  
> 4.16.0-ARCH+ #81
> [0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.11.0-20171110_100015-anatol 04/01/20$
> [0.76] Workqueue: nvme-reset-wq nvme_reset_work
> [0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40
> [0.76] RSP: :c91bfcb0 EFLAGS: 00010246
> [0.76] RAX: 88003b698000 RBX:  RCX: 
> 
> [0.76] RDX: 88003b698000 RSI: fff4 RDI: 
> 001c
> [0.76] RBP: c91bfcc0 R08:  R09: 
> 
> [0.76] R10: eaeaa980 R11: 814e0970 R12: 
> 001c
> [0.76] R13:  R14:  R15: 
> 88003aad8010
> [0.76] FS:  () GS:88003e40() 
> knlGS:
> [0.76] CS:  0010 DS:  ES:  CR0: 80050033
> [0.76] CR2: 01b4 CR3: 02209001 CR4: 
> 000606f0
> [0.76] Call Trace:
> [0.76]  blk_mq_quiesce_queue+0x23/0x80
> [0.76]  nvme_dev_disable+0x34f/0x480
> [0.76]  ? nvme_irq+0x50/0x50
> [0.76]  ? dev_warn+0x64/0x80
> [0.76]  nvme_reset_work+0x13de/0x1570
> [0.76]  ? __switch_to_asm+0x34/0x70
> [0.76]  ? __switch_to_asm+0x40/0x70
> [0.76]  ? _raw_spin_unlock_irq+0x15/0x30
> [0.76]  ? finish_task_switch+0x156/0x210
> [0.76]  process_one_work+0x20c/0x3d0
> [0.76]  worker_thread+0x216/0x400
> [0.76]  kthread+0x125/0x130
> [0.76]  ? process_one_work+0x3d0/0x3d0
> [0.76]  ? __kthread_bind_mask+0x60/0x60
> [0.76]  ret_from_fork+0x3a/0x50

Hello Alexandru,

What made you look at cgroups? In the above register dump I see that %rbx == 
NULL.
I think that means that the queue pointer argument of blk_queue_flag_set() is 
NULL.
The NVMe initiator driver should never pass a NULL pointer to 
blk_mq_quiesce_queue().
Please ask the NVMe driver maintainers for their opinion on the linux-nvme 
mailing
list.

Thanks,

Bart.





Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Tejun Heo
Hello, again.

On Wed, Apr 11, 2018 at 07:51:23AM -0700, Tejun Heo wrote:
> Oh, it wasn't Joseph's change.  It was Bart's fix for a problem
> reported by Joseph.  Bart, a063057d7c73 ("block: Fix a race between
> request queue removal and the block cgroup controller") created a
> regression where a request_queue can be destroyed with blkgs still
> attached.  The original report is..
> 
>  http://lkml.kernel.org/r/20180407102148.ga9...@gmail.com

And looking at the change, it looks like the right thing we should
have done is caching @lock on the print_blkg side and when switching
locks make sure both locks are held.  IOW, do the following in
blk_cleanup_queue()

spin_lock_irq(lock);
if (q->queue_lock != >__queue_lock) {
spin_lock(>__queue_lock);
q->queue_lock = >__queue_lock;
spin_unlock(>__queue_lock);
}
spin_unlock_irq(lock);

Otherwise, there can be two lock holders thinking they have exclusive
access to the request_queue.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Tejun Heo
Hello,

(cc'ing Bart)

On Wed, Apr 11, 2018 at 07:46:16AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote:
> > > Ah, that changed recently.  Can you please check out the current
> > > upstream git master?
> > > 
> > Just did, without my patch I see this crash:
> 
> lol I was looking at the old tree, so this is the fix for the new
> breakage introduced by the recent change.  Sorry about the confusion.
> Joseph, can you please take a look?

Oh, it wasn't Joseph's change.  It was Bart's fix for a problem
reported by Joseph.  Bart, a063057d7c73 ("block: Fix a race between
request queue removal and the block cgroup controller") created a
regression where a request_queue can be destroyed with blkgs still
attached.  The original report is..

 http://lkml.kernel.org/r/20180407102148.ga9...@gmail.com

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Tejun Heo
Hello,

On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote:
> > Ah, that changed recently.  Can you please check out the current
> > upstream git master?
> > 
> Just did, without my patch I see this crash:

lol I was looking at the old tree, so this is the fix for the new
breakage introduced by the recent change.  Sorry about the confusion.
Joseph, can you please take a look?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Alexandru Moise
On Wed, Apr 11, 2018 at 07:20:19AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote:
> > > But we already do this through calling blkcg_exit_queue() from
> > > __blk_release_queue().  What's missing?
> > 
> > Hi,
> > 
> > It might be the jetlag but I can't see how you end up calling
> > blkcg_exit_queue() from __blk_release_queue().
> > 
> > As I see it the only way to reach blkcg_exit_queue() is from
> > blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().
> > 
> > I suspect that I'm just fixing a corner case though and
> > the general case is what you describe or similar.
> 
> Ah, that changed recently.  Can you please check out the current
> upstream git master?
> 
> Thanks.
> 
Just did, without my patch I see this crash:

[0.75] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 4.16.0-ARCH+ #81   
  [7/1949]
[0.75] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/204
[0.75] RIP: 0010:pi_init+0x23f/0x2f0
[0.75] RSP: :c9197d90 EFLAGS: 00010246
[0.75] RAX:  RBX: 0020 RCX: 0038
[0.75] RDX: 0001 RSI: 0001 RDI: 
[0.75] RBP: c9197e18 R08:  R09: 
[0.75] R10: eaeda600 R11: 88003b69f164 R12: 82e2d740
[0.75] R13:  R14:  R15: 
[0.75] FS:  () GS:88003e50() 
knlGS:
[0.75] CS:  0010 DS:  ES:  CR0: 80050033
[0.75] CR2: 0008 CR3: 02209001 CR4: 000606e0
[0.75] Call Trace:
[0.75]  pf_init+0x1db/0x3be
[0.75]  ? pcd_init+0x4e8/0x4e8
[0.75]  do_one_initcall+0x9e/0x1b0
[0.75]  ? do_early_param+0x97/0x97
[0.75]  kernel_init_freeable+0x259/0x2fd
[0.75]  ? rest_init+0xd0/0xd0
[0.75]  ? syscall_slow_exit_work+0x1c/0x160
[0.75]  kernel_init+0xe/0x100
[0.75]  ret_from_fork+0x3a/0x50
[0.75] Code: 75 6a 49 8b 06 48 8b 40 78 48 85 c0 74 08 4c 89 f7 e8 46 
76 51 00 83 c3 01 3b 5d a8 7d 0d 49
[0.75] RIP: pi_init+0x23f/0x2f0 RSP: c9197d90
[0.75] CR2: 0008
[0.75] ---[ end trace 12004f267bb8bf7d ]---
[0.76] BUG: unable to handle kernel NULL pointer dereference at 
01b4
[0.763350] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0009
[0.763350]
[0.76] PGD 0 P4D 0
[0.76] Oops:  [#2] PREEMPT SMP
[0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G  D  
4.16.0-ARCH+ #81
[0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/20$
[0.76] Workqueue: nvme-reset-wq nvme_reset_work
[0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40
[0.76] RSP: :c91bfcb0 EFLAGS: 00010246
[0.76] RAX: 88003b698000 RBX:  RCX: 
[0.76] RDX: 88003b698000 RSI: fff4 RDI: 001c
[0.76] RBP: c91bfcc0 R08:  R09: 
[0.76] R10: eaeaa980 R11: 814e0970 R12: 001c
[0.76] R13:  R14:  R15: 88003aad8010
[0.76] FS:  () GS:88003e40() 
knlGS:
[0.76] CS:  0010 DS:  ES:  CR0: 80050033
[0.76] CR2: 01b4 CR3: 02209001 CR4: 000606f0
[0.76] Call Trace:
[0.76]  blk_mq_quiesce_queue+0x23/0x80
[0.76]  nvme_dev_disable+0x34f/0x480
[0.76]  ? nvme_irq+0x50/0x50
[0.76]  ? dev_warn+0x64/0x80
[0.76]  nvme_reset_work+0x13de/0x1570
[0.76]  ? __switch_to_asm+0x34/0x70
[0.76]  ? __switch_to_asm+0x40/0x70
[0.76]  ? _raw_spin_unlock_irq+0x15/0x30
[0.76]  ? finish_task_switch+0x156/0x210
[0.76]  process_one_work+0x20c/0x3d0
[0.76]  worker_thread+0x216/0x400
[0.76]  kthread+0x125/0x130
[0.76]  ? process_one_work+0x3d0/0x3d0
[0.76]  ? __kthread_bind_mask+0x60/0x60
[0.76]  ret_from_fork+0x3a/0x50


With the patch the crash goes away,

Thanks,
../Alex

> -- 
> tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Tejun Heo
Hello,

On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote:
> > But we already do this through calling blkcg_exit_queue() from
> > __blk_release_queue().  What's missing?
> 
> Hi,
> 
> It might be the jetlag but I can't see how you end up calling
> blkcg_exit_queue() from __blk_release_queue().
> 
> As I see it the only way to reach blkcg_exit_queue() is from
> blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().
> 
> I suspect that I'm just fixing a corner case though and
> the general case is what you describe or similar.

Ah, that changed recently.  Can you please check out the current
upstream git master?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Alexandru Moise
On Mon, Apr 09, 2018 at 03:09:38PM -0700, Tejun Heo wrote:
> (cc'ing Joseph as he worked on the area recently, hi!)
> 
> Hello,
> 
> On Sat, Apr 07, 2018 at 12:21:48PM +0200, Alexandru Moise wrote:
> > The q->id is used as an index within the blkg_tree radix tree.
> > 
> > If the entry is not released before reclaiming the blk_queue_ida's id
> > blkcg_init_queue() within a different driver from which this id
> > was originally for can fail due to the entry at that index within
> > the radix tree already existing.
> > 
> > Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com>
> > ---
> > v2: Added no-op for !CONFIG_BLK_CGROUP
> > 
> >  block/blk-cgroup.c | 2 +-
> >  block/blk-sysfs.c  | 4 
> >  include/linux/blk-cgroup.h | 3 +++
> >  3 files changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 1c16694ae145..224e937dbb59 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
> >   *
> >   * Destroy all blkgs associated with @q.
> >   */
> > -static void blkg_destroy_all(struct request_queue *q)
> > +void blkg_destroy_all(struct request_queue *q)
> >  {
> > struct blkcg_gq *blkg, *n;
> >  
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index d00d1b0ec109..a72866458f22 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct 
> > *work)
> > if (q->bio_split)
> > bioset_free(q->bio_split);
> >  
> > +   spin_lock_irq(q->queue_lock);
> > +   blkg_destroy_all(q);
> > +   spin_unlock_irq(q->queue_lock);
> > +
> > ida_simple_remove(_queue_ida, q->id);
> > call_rcu(>rcu_head, blk_free_queue_rcu);
> 
> But we already do this through calling blkcg_exit_queue() from
> __blk_release_queue().  What's missing?

Hi,

It might be the jetlag but I can't see how you end up calling
blkcg_exit_queue() from __blk_release_queue().

As I see it the only way to reach blkcg_exit_queue() is from
blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().

I suspect that I'm just fixing a corner case though and
the general case is what you describe or similar.

../Alex
> 
> Thanks.
> 
> -- 
> tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-09 Thread Tejun Heo
(cc'ing Joseph as he worked on the area recently, hi!)

Hello,

On Sat, Apr 07, 2018 at 12:21:48PM +0200, Alexandru Moise wrote:
> The q->id is used as an index within the blkg_tree radix tree.
> 
> If the entry is not released before reclaiming the blk_queue_ida's id
> blkcg_init_queue() within a different driver from which this id
> was originally for can fail due to the entry at that index within
> the radix tree already existing.
> 
> Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com>
> ---
> v2: Added no-op for !CONFIG_BLK_CGROUP
> 
>  block/blk-cgroup.c | 2 +-
>  block/blk-sysfs.c  | 4 
>  include/linux/blk-cgroup.h | 3 +++
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 1c16694ae145..224e937dbb59 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
>   *
>   * Destroy all blkgs associated with @q.
>   */
> -static void blkg_destroy_all(struct request_queue *q)
> +void blkg_destroy_all(struct request_queue *q)
>  {
>   struct blkcg_gq *blkg, *n;
>  
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index d00d1b0ec109..a72866458f22 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct *work)
>   if (q->bio_split)
>   bioset_free(q->bio_split);
>  
> + spin_lock_irq(q->queue_lock);
> + blkg_destroy_all(q);
> + spin_unlock_irq(q->queue_lock);
> +
>   ida_simple_remove(_queue_ida, q->id);
>   call_rcu(>rcu_head, blk_free_queue_rcu);

But we already do this through calling blkcg_exit_queue() from
__blk_release_queue().  What's missing?

Thanks.

-- 
tejun