On 12/4/18 7:16 PM, Jens Axboe wrote: > On 12/4/18 6:37 PM, Ming Lei wrote: >> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >>> we queue the request up normally. However, the SCSI layer may have >>> already setup SG tables etc for this particular command. If we later >>> merge with this request, then the old tables are no longer valid. Once >>> we issue the IO, we only read/write the original part of the request, >>> not the new state of it. >>> >>> This causes data corruption, and is most often noticed with the file >>> system complaining about the just read data being invalid: >>> >>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: >>> comm dpkg-query: bad extra_isize 24937 (inode size 256) >>> >>> because most of it is garbage... >>> >>> This doesn't happen from the normal issue path, as we will simply defer >>> the request to the hardware queue dispatch list if we fail. Once it's on >>> the dispatch list, we never merge with it. >>> >>> Fix this from the direct issue path by flagging the request as >>> REQ_NOMERGE so we don't change the size of it before issue. >>> >>> See also: >>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >>> >>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case >>> of 'none'") >>> Signed-off-by: Jens Axboe <ax...@kernel.dk> >>> >>> --- >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 3f91c6e5b17a..d8f518c6ea38 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct >>> blk_mq_hw_ctx *hctx, >>> break; >>> case BLK_STS_RESOURCE: >>> case BLK_STS_DEV_RESOURCE: >>> + /* >>> + * If direct dispatch fails, we cannot allow any merging on >>> + * this IO. Drivers (like SCSI) may have set up permanent state >>> + * for this request, like SG tables and mappings, and if we >>> + * merge to it later on then we'll still only do IO to the >>> + * original part. >>> + */ >>> + rq->cmd_flags |= REQ_NOMERGE; >>> + >>> blk_mq_update_dispatch_busy(hctx, true); >>> __blk_mq_requeue_request(rq); >>> break; >>> >> >> Not sure it is enough to just mark it as NOMERGE, for example, driver >> may have setup the .special_vec for discard, and NOMERGE may not prevent >> request from entering elevator queue completely. Cause 'rq.rb_node' and >> 'rq.special_vec' share same space. > > We should rather limit the scope of the direct dispatch instead. It > doesn't make sense to do for anything but read/write anyway. > >> So how about inserting this request via blk_mq_request_bypass_insert() >> in case that direct issue returns BUSY? Then it is invariant that >> any request queued via .queue_rq() won't enter scheduler queue. > > I did consider this, but I didn't want to experiment with exercising > a new path for an important bug fix. You do realize that your original > patch has been corrupting data for months? I think a little caution > is in order here.
Here's a further limiting version. And we seriously need to clean up the direct issue paths, it's ridiculous. diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..3262d83b9e07 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, break; case BLK_STS_RESOURCE: case BLK_STS_DEV_RESOURCE: + /* + * If direct dispatch fails, we cannot allow any merging on + * this IO. Drivers (like SCSI) may have set up permanent state + * for this request, like SG tables and mappings, and if we + * merge to it later on then we'll still only do IO to the + * original part. + */ + rq->cmd_flags |= REQ_NOMERGE; + blk_mq_update_dispatch_busy(hctx, true); __blk_mq_requeue_request(rq); break; @@ -1727,6 +1736,18 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, return ret; } +/* + * Don't allow direct dispatch of anything but regular reads/writes, + * as some of the other commands can potentially share request space + * with data we need for the IO scheduler. If we attempt a direct dispatch + * on those and fail, we can't safely add it to the scheduler afterwards + * without potentially overwriting data that the driver has already written. + */ +static bool blk_rq_can_direct_dispatch(struct request *rq) +{ + return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE; +} + static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_qc_t *cookie, @@ -1748,7 +1769,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, goto insert; } - if (q->elevator && !bypass_insert) + if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert)) goto insert; if (!blk_mq_get_dispatch_budget(hctx)) @@ -1810,6 +1831,9 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, struct request *rq = list_first_entry(list, struct request, queuelist); + if (!blk_rq_can_direct_dispatch(rq)) + break; + list_del_init(&rq->queuelist); ret = blk_mq_request_issue_directly(rq); if (ret != BLK_STS_OK) { -- Jens Axboe