On Thu, Jul 19, 2018 at 06:29:24PM +0200, h...@lst.de wrote:
> How do we get a fix into 4.18 at this part of the cycle? I think that
> is the most important prirority right now.
Even if you were okay at this point to incorporate the concepts from
Bart's patch, it still looks like trouble for
On Thu, Jul 19, 2018 at 10:22:40AM -0600, Keith Busch wrote:
> On Thu, Jul 19, 2018 at 04:04:46PM +, Bart Van Assche wrote:
> > On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> > > Even some scsi drivers are susceptible to losing their requests with the
> > > reverted behavior: take
On Thu, Jul 19, 2018 at 04:04:46PM +, Bart Van Assche wrote:
> On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> > Even some scsi drivers are susceptible to losing their requests with the
> > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> > from it's timeout
On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> Even some scsi drivers are susceptible to losing their requests with the
> reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> from it's timeout handler. With the behavior everyone seems to want,
> a natural
On Thu, Jul 19, 2018 at 08:59:31AM -0600, Keith Busch wrote:
> > And we should never even hit the timeout handler for those as it
> > is rather pointless (although it looks we currently do..).
>
> I don't see why we'd expect to never hit timeout for at least some of
> these. It's not a stretch to
On Thu, Jul 19, 2018 at 03:19:04PM +0200, h...@lst.de wrote:
> On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote:
> > And there may be other drivers that don't want their completions
> > ignored, so breaking them again is also a step in the wrong direction.
> >
> > There are not that
On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote:
> And there may be other drivers that don't want their completions
> ignored, so breaking them again is also a step in the wrong direction.
>
> There are not that many blk-mq drivers, so we can go through them all.
I think the point is
On Wed, Jul 18, 2018 at 10:39:36PM +0200, h...@lst.de wrote:
> On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> > On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > > Jianchao's detailed
On Wed, Jul 18, 2018 at 09:30:11PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> > There are not that many blk-mq drivers, so we can go through them all.
>
> I'm not sure that's the right approach. I think it is the responsibility of
> the block layer to
On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> There are not that many blk-mq drivers, so we can go through them all.
I'm not sure that's the right approach. I think it is the responsibility of
the block layer to handle races between completion handling and timeout
handling and that this
On Wed, Jul 18, 2018 at 08:58:48PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> > If scsi needs this behavior, why not just put that behavior in scsi? It
> > can set the state to complete and then everything can play out as
> > before.
> > [ ... ]
>
>
On Wed, 2018-07-18 at 22:39 +0200, h...@lst.de wrote:
> On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> > The important bit is that we need to fix this issue quickly. We are
> > past -rc5 so I'm rather concerned about anything too complicated.
> >
> > I'm not even sure SCSI has a
On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> If scsi needs this behavior, why not just put that behavior in scsi? It
> can set the state to complete and then everything can play out as
> before.
> [ ... ]
There may be other drivers that need the same protection the SCSI core needs
so I
On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully
On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully
On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> Of the two you mentioned, yours is preferable IMO. While I appreciate
> Jianchao's detailed analysis, it's hard to take a proposal seriously
> that so colourfully calls everyone else "dangerous" while advocating
> for silently losing
On Fri, Jul 13, 2018 at 11:03:18PM +, Bart Van Assche wrote:
> How do you want to go forward from here? Do you prefer the approach of the
> patch I had posted (https://www.spinics.net/lists/linux-block/msg26489.html),
> Jianchao's approach (https://marc.info/?l=linux-block=152950093831738) or
On Fri, 2018-07-13 at 12:47 -0600, Keith Busch wrote:
> On Fri, Jul 13, 2018 at 03:52:38PM +, Bart Van Assche wrote:
> > I think that behavior change even can trigger a kernel crash.
>
> I think we are past acknowledging issues exist with timeouts.
Hello Keith,
How do you want to go forward
On Fri, Jul 13, 2018 at 03:52:38PM +, Bart Van Assche wrote:
> No. What I'm saying is that a behavior change has been introduced in the block
> layer that was not documented in the patch description.
Did you read the changelog?
> I think that behavior change even can trigger a kernel crash.
On Fri, 2018-07-13 at 09:43 -0600, Keith Busch wrote:
> On Thu, Jul 12, 2018 at 10:24:42PM +, Bart Van Assche wrote:
> > Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> > request completion was reported after request timeout processing had
> > started, completion
On Thu, Jul 12, 2018 at 10:24:42PM +, Bart Van Assche wrote:
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
>
On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> Hello Keith,
>
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
>
On 07/13/2018 06:24 AM, Bart Van Assche wrote:
> On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote:
>> On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote:
>>> What prevents that a request finishes and gets reused after the
>>> blk_mq_req_expired() call has finished and before
On Thu, 2018-07-12 at 13:24 -0600, Keith Busch wrote:
> On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote:
> > What prevents that a request finishes and gets reused after the
> > blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
> > called? Is this perhaps a
On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote:
> On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> > /*
> > -* We marked @rq->aborted_gstate and waited for RCU. If there were
> > -* completions that we lost to, they would have finished and
> > -* updated
On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> /*
> - * We marked @rq->aborted_gstate and waited for RCU. If there were
> - * completions that we lost to, they would have finished and
> - * updated @rq->gstate by now; otherwise, the completion path is
> - * now
On Wed, May 23, 2018 at 08:35:40AM -0600, Keith Busch wrote:
> On Wed, May 23, 2018 at 08:34:48AM +0800, Ming Lei wrote:
> > Let's consider the normal NVMe timeout code path:
> >
> > 1) one request is timed out;
> >
> > 2) controller is shutdown, this timed-out request is requeued from
> >
On Wed, May 23, 2018 at 08:34:48AM +0800, Ming Lei wrote:
> Let's consider the normal NVMe timeout code path:
>
> 1) one request is timed out;
>
> 2) controller is shutdown, this timed-out request is requeued from
> nvme_cancel_request(), but can't dispatch because queues are quiesced
>
> 3)
On 05/22/2018 06:17 PM, Christoph Hellwig wrote:
Hi Keith,
I like this series a lot. One comment that is probably close
to the big discussion in the thread:
switch (ret) {
case BLK_EH_HANDLED:
/*
+* If the request is still in flight, the driver
On Tue, May 22, 2018 at 06:17:04PM +0200, Christoph Hellwig wrote:
> Hi Keith,
>
> I like this series a lot. One comment that is probably close
> to the big discussion in the thread:
>
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > /*
> > +* If the request is
On Tue, 2018-05-22 at 10:34 -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 04:29:07PM +, Bart Van Assche wrote:
> > Please have another look at the current code that handles request timeouts
> > and completions. The current implementation guarantees that no double
> > completions can
On Tue, May 22, 2018 at 04:29:07PM +, Bart Van Assche wrote:
> Please have another look at the current code that handles request timeouts
> and completions. The current implementation guarantees that no double
> completions can occur but your patch removes essential aspects of that
>
On Tue, 2018-05-22 at 08:15 -0600, Keith Busch wrote:
> This shouldn't be introducing any new concorrent calls to
> __blk_mq_complete_request if they don't already exist. The timeout work
> calls it only if the driver's timeout returns BLK_EH_HANDLED, which means
> the driver is claiming the
Hi Keith,
I like this series a lot. One comment that is probably close
to the big discussion in the thread:
> switch (ret) {
> case BLK_EH_HANDLED:
> /*
> + * If the request is still in flight, the driver is requesting
> + * blk-mq complete
On Tue, May 22, 2018 at 11:17 PM, Keith Busch
wrote:
> On Tue, May 22, 2018 at 11:07:07PM +0800, Ming Lei wrote:
>> > At approximately the same time, you're saying the driver that returned
>> > EH_HANDLED may then call blk_mq_complete_request() in reference to the
>>
On Tue, May 22, 2018 at 11:07:07PM +0800, Ming Lei wrote:
> > At approximately the same time, you're saying the driver that returned
> > EH_HANDLED may then call blk_mq_complete_request() in reference to the
> > *old* request that it returned EH_HANDLED for, incorrectly completing
>
> Because
On Tue, May 22, 2018 at 11:01 PM, Keith Busch
wrote:
> On Tue, May 22, 2018 at 10:57:40PM +0800, Ming Lei wrote:
>> OK, that still depends on driver's behaviour, even though it is true
>> for NVMe, you still have to audit other drivers about this sync
>> because it
On Tue, May 22, 2018 at 10:57:40PM +0800, Ming Lei wrote:
> OK, that still depends on driver's behaviour, even though it is true
> for NVMe, you still have to audit other drivers about this sync
> because it is required by your patch.
Okay, forget about nvme for a moment here. Please run this
On Tue, May 22, 2018 at 10:46 PM, Keith Busch
wrote:
> On Tue, May 22, 2018 at 10:37:32PM +0800, Ming Lei wrote:
>> On Tue, May 22, 2018 at 10:20 PM, Keith Busch
>> wrote:
>> > In the event the driver requests a normal completion, the
On Tue, May 22, 2018 at 10:37:32PM +0800, Ming Lei wrote:
> On Tue, May 22, 2018 at 10:20 PM, Keith Busch
> wrote:
> > In the event the driver requests a normal completion, the timeout work
> > releasing the last reference doesn't do a second completion: it only
>
>
On Tue, May 22, 2018 at 10:20 PM, Keith Busch
wrote:
> On Tue, May 22, 2018 at 10:49:11AM +0800, Ming Lei wrote:
>> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
>> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
>> > +static void
On 5/22/18 2:51 AM, Ming Lei wrote:
> On Mon, May 21, 2018 at 09:51:19PM -0600, Jens Axboe wrote:
>> On May 21, 2018, at 9:47 PM, Ming Lei wrote:
>>>
On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
> On 5/21/18 8:49 PM, Ming Lei wrote:
>> On Mon, May
On Tue, May 22, 2018 at 10:49:11AM +0800, Ming Lei wrote:
> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
> > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> > struct request *rq, void
On Mon, May 21, 2018 at 11:29:06PM +, Bart Van Assche wrote:
> On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > - __blk_mq_complete_request(req);
> > - break;
> > - case BLK_EH_RESET_TIMER:
> > /*
>
On Mon, May 21, 2018 at 09:51:19PM -0600, Jens Axboe wrote:
> On May 21, 2018, at 9:47 PM, Ming Lei wrote:
> >
> >> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
> >>> On 5/21/18 8:49 PM, Ming Lei wrote:
> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith
On May 21, 2018, at 9:47 PM, Ming Lei wrote:
>
>> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
>>> On 5/21/18 8:49 PM, Ming Lei wrote:
On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
This patch simplifies the timeout handling by relying
On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
> On 5/21/18 8:49 PM, Ming Lei wrote:
> > On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> >> This patch simplifies the timeout handling by relying on the request
> >> reference counting to ensure the iterator is operating
On 5/21/18 8:49 PM, Ming Lei wrote:
> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
>> This patch simplifies the timeout handling by relying on the request
>> reference counting to ensure the iterator is operating on an inflight
>
> The reference counting isn't free, what is the
On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> This patch simplifies the timeout handling by relying on the request
> reference counting to ensure the iterator is operating on an inflight
The reference counting isn't free, what is the real benefit in this way?
> and truly timed
On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
> void blk_mq_complete_request(struct request *rq)
> {
> struct request_queue *q = rq->q;
> - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q,
This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from
51 matches
Mail list logo