Re: [dm-devel] RFC: one more time: SCSI device identification
On Wed, 2021-04-28 at 10:09 +1000, Erwin van Londen wrote: > > On Tue, 2021-04-27 at 16:41 -0400, Ewan D. Milne wrote: > > On Tue, 2021-04-27 at 20:33 +, Martin Wilck wrote: > > > On Tue, 2021-04-27 at 16:14 -0400, Ewan D. Milne wrote: > > > > There's no way to do that, in principle. Because there could > > > > be > > > > other I/Os in flight. You might (somehow) avoid retrying an > > > > I/O > > > > that got a UA until you figured out if something changed, but > > > > other > > > > I/Os can already have been sent to the target, or issued before > > > > you > > > > get to look at the status. > > If something happens on a storage side where a lun gets it's > attributes changed (any, doesn't matter which one) a UA should be > sent. Also all outstanding IO's on that lun should be returning an > Abort as it can no longer warrant the validity of any IO due to these > changes. Especially when parameters are involved like reservations > (PR's) etc. If that does not happen from an array side all bets are > off as the only way to be able to get back in business is to start > from scratch. Perhaps an array might abort I/Os it has received in the Device Server whensomething changes. I have no idea if most or any arrays actually do that. But, what about I/O that has already been queued from the host to thehost bus adapter? I don't see how we can abort those I/Os properly.Most high-performance HBAs have a queue of commands and a queueof responses, there could be lots of commands queued before wemanage to notice an interesting status. And AFAIK there is no conditionalmechanism that could hold them off (and, they could be in- flight on thewire anyway). I get what you are saying about what SAM describes, I just don't see howwe can guarantee we don't send any further commands after the statuswith the UA is sent back, before we can understand what happened. -Ewan > > > > > > Right. But in practice, a WWID change will hardly happen under > > > full > > > IO > > > load. The storage side will probably have to block IO while this > > > happens, at least for a short time period. So blocking and > > > quiescing > > > the queue upon an UA might still work, most of the time. Even if > > > we > > > were too late already, the sooner we stop the queue, the better. > > I think in most cases when something happens on an array side you > will see IO's being aborted. That might be a good time to start doing > TUR's and if these come back OK do a new inquiry. From a host side > there is only so much you can do. > > > > The current algorithm in multipath-tools needs to detect a path > > > going > > > down and being reinstated. The time interval during which a WWID > > > change > > > will go unnoticed is one or more path checker intervals, > > > typically on > > > the order of 5-30 seconds. If we could decrease this interval to > > > a > > > sub- > > > second or even millisecond range by blocking the queue in the > > > kernel > > > quickly, we'd have made a big step forward. > > > > Yes, and in many situations this may help. But in the general case > > we can't protect against a storage array misconfiguration, > > where something like this can happen. So I worry about people > > believing the host software will protect them against a mistake, > > when we can't really do that. > > My thought exactly. > > > All it takes is one I/O (a discard) to make a thorough mess of the > > LUN. > > > > -Ewan > > > > > Regards > > > Martin > > > > > > > -- > > dm-devel mailing list > > dm-devel@redhat.com > > https://listman.redhat.com/mailman/listinfo/dm-devel > > > > --dm-devel mailing listdm-de...@redhat.com > https://listman.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH v2 1/2] scsi: convert scsi_result_to_blk_status() to inline
On 4/29/21 1:33 PM, Martin Wilck wrote: > On Thu, 2021-04-29 at 09:20 -0700, Bart Van Assche wrote: >> On 4/29/21 8:50 AM, mwi...@suse.com wrote: >>> This makes it possible to use scsi_result_to_blk_status() from >>> code that shouldn't depend on scsi_mod (e.g. device mapper). >> >> I think that's the wrong reason to make a function inline. Please >> consider moving scsi_result_to_blk_status() into one of the block >> layer >> source code files that already deals with SG I/O, e.g. >> block/scsi_ioctl.c. > > scsi_ioctl.c, are you certain? scsi_result_to_blk_status() is an > important part of the block/scsi interface... You're right that that > this function is not a prime candidate for inlining, but I'm still > wondering where it belongs if we don't. The block/scsi_ioctl.c file is included in the kernel if and only if BLK_SCSI_REQUEST is enabled. And that Kconfig symbol only selects the block/scsi_ioctl.c file. Additionally, the following occurs in the SCSI Kconfig file: config SCSI tristate "SCSI device support" [ ... ] select BLK_SCSI_REQUEST In other words, block/scsi_ioctl.c is built unconditionally if SCSI is enabled. Adding the scsi_result_to_blk_status() function to the block/scsi_ioctl.c file would increase the size of kernels that enable bsg, ide, the SCSI target code or nfsd but not the SCSI initiator code. If the latter is a concern, how about moving scsi_result_to_blk_status() into a new file in the block directory? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm: dm_blk_ioctl(): implement failover for SG_IO on dm-multipath
On Thu, Apr 29 2021 at 2:02am -0400, Hannes Reinecke wrote: > On 4/28/21 9:54 PM, Mike Snitzer wrote: > >On Thu, Apr 22 2021 at 4:21P -0400, > >mwi...@suse.com wrote: > > > >>From: Martin Wilck > >> > >>In virtual deployments, SCSI passthrough over dm-multipath devices is a > >>common setup. The qemu "pr-helper" was specifically invented for it. I > >>believe that this is the most important real-world scenario for sending > >>SG_IO ioctls to device-mapper devices. > >> > >>In this configuration, guests send SCSI IO to the hypervisor in the form of > >>SG_IO ioctls issued by qemu. But on the device-mapper level, these SCSI > >>ioctls aren't treated like regular IO. Until commit 2361ae595352 ("dm mpath: > >>switch paths in dm_blk_ioctl() code path"), no path switching was done at > >>all. Worse though, if an SG_IO call fails because of a path error, > >>dm-multipath doesn't retry the IO on a another path; rather, the failure is > >>passed back to the guest, and paths are not marked as faulty. This is in > >>stark contrast with regular block IO of guests on dm-multipath devices, and > >>certainly comes as a surprise to users who switch to SCSI passthrough in > >>qemu. In general, users of dm-multipath devices would probably expect > >>failover > >>to work at least in a basic way. > >> > >>This patch fixes this by taking a special code path for SG_IO on request- > >>based device mapper targets. Rather then just choosing a single path, > >>sending the IO to it, and failing to the caller if the IO on the path > >>failed, it retries the same IO on another path for certain error codes, > >>using the same logic as blk_path_error() to determine if a retry would make > >>sense for the given error code. Moreover, it sends a message to the > >>multipath target to mark the path as failed. > >> > >>I am aware that this looks sort of hack-ish. I'm submitting it here as an > >>RFC, asking for guidance how to reach a clean implementation that would be > >>acceptable in mainline. I believe that it fixes an important problem. > >>Actually, it fixes the qemu SCSI-passthrough use case on dm-multipath, > >>which is non-functional without it. > >> > >>One problem remains open: if all paths in a multipath map are failed, > >>normal multipath IO may switch to queueing mode (depending on > >>configuration). This isn't possible for SG_IO, as SG_IO requests can't > >>easily be queued like regular block I/O. Thus in the "no path" case, the > >>guest will still see an error. If anybody can conceive of a solution for > >>that, I'd be interested. > >> > >>Signed-off-by: Martin Wilck > >>--- > >> block/scsi_ioctl.c | 5 +- > >> drivers/md/dm.c| 134 - > >> include/linux/blkdev.h | 2 + > >> 3 files changed, 137 insertions(+), 4 deletions(-) > >> > >>diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c > >>index 6599bac0a78c..bcc60552f7b1 100644 > >>--- a/block/scsi_ioctl.c > >>+++ b/block/scsi_ioctl.c > >>@@ -279,8 +279,8 @@ static int blk_complete_sghdr_rq(struct request *rq, > >>struct sg_io_hdr *hdr, > >>return ret; > >> } > >>-static int sg_io(struct request_queue *q, struct gendisk *bd_disk, > >>- struct sg_io_hdr *hdr, fmode_t mode) > >>+int sg_io(struct request_queue *q, struct gendisk *bd_disk, > >>+ struct sg_io_hdr *hdr, fmode_t mode) > >> { > >>unsigned long start_time; > >>ssize_t ret = 0; > >>@@ -369,6 +369,7 @@ static int sg_io(struct request_queue *q, struct > >>gendisk *bd_disk, > >>blk_put_request(rq); > >>return ret; > >> } > >>+EXPORT_SYMBOL_GPL(sg_io); > >> /** > >> * sg_scsi_ioctl -- handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl > >>diff --git a/drivers/md/dm.c b/drivers/md/dm.c > >>index 50b693d776d6..443ac5e5406c 100644 > >>--- a/drivers/md/dm.c > >>+++ b/drivers/md/dm.c > >>@@ -29,6 +29,8 @@ > >> #include > >> #include > >> #include > >>+#include > >>+#include > >> #define DM_MSG_PREFIX "core" > > > >Ngh... not loving this at all. But here is a patch (that I didn't > >compile test) that should be folded in, still have to think about how > >this could be made cleaner in general though. > > > >Also, dm_sg_io_ioctl should probably be in dm-rq.c (maybe even dm-mpath.c!?) > > > I fully share your sentiments; this just feels so awkward, having to > redo the logic in scsi_cmd_ioctl(). > > My original idea to that was to _use_ scsi_cmd_ioctl() directly from > dm_blk_ioctl: > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index 50b693d776d6..007ff981f888 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -567,6 +567,9 @@ static int dm_blk_ioctl(struct block_device > *bdev, fmode_t mode, > struct mapped_device *md = bdev->bd_disk->private_data; > int r, srcu_idx; > > + if (cmd == SG_IO) > + return scsi_cmd_blk_ioctl(bdev, mode, cmd, arg); > + > r = dm_prepare_ioctl(md, _idx, ); > if (r < 0) > goto out; > > > But that
[dm-devel] [git pull] device mapper changes for 5.13
Hi Linus, Some changes were slightly late and their inclusion made later because I had a worse-than-expected reaction to my 2nd covid shot the past couple days. Nothing crazy here but wanted to explain why some commits are recent. The following changes since commit 4edbe1d7bcffcd6269f3b5eb63f710393ff2ec7a: dm ioctl: fix out of bounds array access when no devices (2021-03-26 14:51:50 -0400) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-5.13/dm-changes for you to fetch changes up to ca4a4e9a55beeb138bb06e3867f5e486da896d44: dm raid: remove unnecessary discard limits for raid0 and raid10 (2021-04-30 14:38:37 -0400) Please pull, thanks. Mike - Improve scalability of DM's device hash by switching to rbtree - Extend DM ioctl's DM_LIST_DEVICES_CMD handling to include UUID and allow filtering based on name or UUID prefix. - Various small fixes for typos, warnings, unused function, or needlessly exported interfaces. - Remove needless request_queue NULL pointer checks in DM thin and cache targets. - Remove unnecessary loop in DM core's __split_and_process_bio(). - Remove DM core's dm_vcalloc() and just use kvcalloc or kvmalloc_array instead (depending whether zeroing is useful). - Fix request-based DM's double free of blk_mq_tag_set in device remove after table load fails. - Improve DM persistent data performance on non-x86 by fixing packed structs to have a stated alignment. Also remove needless extra work from redundant calls to sm_disk_get_nr_free() and a paranoid BUG_ON() that caused duplicate checksum calculation. - Fix missing goto in DM integrity's bitmap_flush_interval error handling. - Add "reset_recalculate" feature flag to DM integrity. - Improve DM integrity by leveraging discard support to avoid needless re-writing of metadata and also use discard support to improve hash recalculation. - Fix race with DM raid target's reshape and MD raid4/5/6 resync that resulted in inconsistant reshape state during table reloads. - Update DM raid target to temove unnecessary discard limits for raid0 and raid10 now that MD has optimized discard handling for both raid levels. Benjamin Block (1): dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails Bhaskar Chowdhury (1): dm ebs: fix a few typos Christoph Hellwig (1): dm: unexport dm_{get,put}_table_device Gustavo A. R. Silva (1): dm raid: fix fall-through warning in rs_check_takeover() for Clang Heinz Mauelshagen (1): dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences JeongHyeon Lee (1): dm verity: allow only one error handling mode Jiapeng Chong (2): dm persistent data: remove unused return from exit_shadow_spine() dm clone metadata: remove unused function Joe Thornber (4): dm space map disk: remove redundant calls to sm_disk_get_nr_free() dm btree spine: remove paranoid node_check call in node_prep_for_write() dm persistent data: packed struct should have an aligned() attribute too dm space map common: fix division bug in sm_ll_find_free_block() Julia Lawall (1): dm writecache: fix flexible_array.cocci warnings Matthew Wilcox (Oracle) (1): dm: replace dm_vcalloc() Mike Snitzer (1): dm raid: remove unnecessary discard limits for raid0 and raid10 Mikulas Patocka (8): dm: remove useless loop in __split_and_process_bio dm ioctl: replace device hash with red-black tree dm ioctl: return UUID in DM_LIST_DEVICES_CMD result dm ioctl: filter the returned values according to name or uuid prefix dm integrity: add the "reset_recalculate" feature flag dm integrity: don't re-write metadata if discarding same blocks dm integrity: increase RECALC_SECTORS to improve recalculate speed dm integrity: use discard support when recalculating Tian Tao (1): dm integrity: fix missing goto in bitmap_flush_interval error handling Xu Wang (2): dm thin: remove needless request_queue NULL pointer check dm cache: remove needless request_queue NULL pointer checks drivers/md/dm-cache-target.c | 2 +- drivers/md/dm-clone-metadata.c | 6 - drivers/md/dm-ebs-target.c | 6 +- drivers/md/dm-integrity.c| 85 --- drivers/md/dm-ioctl.c| 294 ++- drivers/md/dm-raid.c | 44 ++-- drivers/md/dm-rq.c | 2 + drivers/md/dm-snap-persistent.c | 6 +- drivers/md/dm-snap.c | 5 +- drivers/md/dm-table.c| 30 +-- drivers/md/dm-thin.c |
[dm-devel] Raid1 and Raid10 Discard Limit Fixups for 5.13 Merge Window
Hi Mike, The Raid10 block discard performance patchset from Xiao Ni has reached mainline this morning in the following commits: 254c271da071 md/raid10: improve discard request for far layout d30588b2731f md/raid10: improve raid10 discard request f2e7e269a752 md/raid10: pull the code that wait for blocked dev into one function c2968285925a md/raid10: extend r10bio devs to raid disks cf78408f937a md: add md_submit_discard_bio() for submitting discard bio I was wondering, are you planning to resubmit your patches to fixup discard limits for raid1 and raid10 this merge cycle? commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Thanks, Matthew -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [LKP] Re: [ext4] 21175ca434: mdadm-selftests.enchmarks/mdadm-selftests/tests/01r1fail.fail
On 4/28/21 10:03 PM, Theodore Ts'o wrote: (Hmm, why did you cc linux-km on this report? I would have thought dm-devel would have made more sense?) On Tue, Apr 27, 2021 at 04:15:39PM +0800, kernel test robot wrote: FYI, we noticed the following commit (built with gcc-9): commit: 21175ca434c5d49509b73cf473618b01b0b85437 ("ext4: make prefetch_block_bitmaps default") https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master in testcase: mdadm-selftests version: mdadm-selftests-x86_64-5d518de-1_20201008 with following parameters: disk: 1HDD test_prefix: 01r1 ucode: 0x21 So this failure makes no sense to me. Looking at the kmesg failure logs, it's failing in the md layer: kern :info : [ 99.775514] md/raid1:md0: not clean -- starting background reconstruction kern :info : [ 99.783372] md/raid1:md0: active with 3 out of 4 mirrors kern :info : [ 99.789735] md0: detected capacity change from 0 to 37888 kern :info : [ 99.796216] md: resync of RAID array md0 kern :crit : [ 99.900450] md/raid1:md0: Disk failure on loop2, disabling device. md/raid1:md0: Operation continuing on 2 devices. kern :crit : [ 99.918281] md/raid1:md0: Disk failure on loop1, disabling device. md/raid1:md0: Operation continuing on 1 devices. kern :info : [ 100.835833] md: md0: resync interrupted. kern :info : [ 101.852898] md: resync of RAID array md0 kern :info : [ 101.858347] md: md0: resync done. user :notice: [ 102.109684] /lkp/benchmarks/mdadm-selftests/tests/01r1fail... FAILED - see /var/tmp/01r1fail.log and /var/tmp/fail01r1fail.log for details The referenced commit just turns block bitmap prefetching in ext4. This should not cause md to failure; if so, that's an md bug, not an ext4 bug. There should not be anything that the file system is doing that would cause the kernel to think there is a disk failure. By the way, the reproduction instructions aren't working currently: To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp installjob.yaml # job file is attached in this email This fails because lkp is trying to apply a patch which does not apply with the current version of the md tools. Hi Ted, Thanks for the feedback, yes, there's patch already be merged into mdadm, we have removed it from our code. bin/lkp split-job --compatible job.yaml bin/lkp runcompatible-job.yaml And the current versions lkp don't generate a compatible-job.yaml file when you run "lkp split-job --compatable"; instead it generates a new yaml file with a set of random characters to generate a unique name. (What Multics parlance would be called a "shriek name"[1] :-) We have updated the steps to avoid misunderstanding. Since I was having trouble running the reproduction; could you send the /var/tmp/*fail.logs so we could have a bit more insight what is going on? I attached the log file for your reference, btw the test is from https://github.com/neilbrown/mdadm/blob/master/tests/01r1fail, you may want to run it directly. Best Regards, Rong Chen Thanks! - Ted ___ LKP mailing list -- l...@lists.01.org To unsubscribe send an email to lkp-le...@lists.01.org + . /lkp/benchmarks/mdadm-selftests/tests/01r1fail ++ mdadm -CR /dev/md0 -l1 -n4 /dev/loop0 /dev/loop1 /dev/loop2 missing ++ rm -f /var/tmp/stderr ++ case $* in ++ case $* in ++ for args in $* ++ [[ -CR =~ /dev/ ]] ++ for args in $* ++ [[ /dev/md0 =~ /dev/ ]] ++ [[ /dev/md0 =~ md ]] ++ for args in $* ++ [[ -l1 =~ /dev/ ]] ++ for args in $* ++ [[ -n4 =~ /dev/ ]] ++ for args in $* ++ [[ /dev/loop0 =~ /dev/ ]] ++ [[ /dev/loop0 =~ md ]] ++ /lkp/benchmarks/mdadm-selftests/mdadm --zero /dev/loop0 mdadm: Unrecognised md component device - /dev/loop0 ++ for args in $* ++ [[ /dev/loop1 =~ /dev/ ]] ++ [[ /dev/loop1 =~ md ]] ++ /lkp/benchmarks/mdadm-selftests/mdadm --zero /dev/loop1 mdadm: Unrecognised md component device - /dev/loop1 ++ for args in $* ++ [[ /dev/loop2 =~ /dev/ ]] ++ [[ /dev/loop2 =~ md ]] ++ /lkp/benchmarks/mdadm-selftests/mdadm --zero /dev/loop2 mdadm: Unrecognised md component device - /dev/loop2 ++ for args in $* ++ [[ missing =~ /dev/ ]] ++ /lkp/benchmarks/mdadm-selftests/mdadm --quiet -CR /dev/md0 -l1 -n4 /dev/loop0 /dev/loop1 /dev/loop2 missing --auto=yes ++ rv=0 ++ case $* in ++ cat /var/tmp/stderr ++ return 0 ++ check resync ++ case $1 in ++ cnt=5 ++ grep -sq resync /proc/mdstat ++ mdadm /dev/md0 --fail /dev/loop2 ++ rm -f /var/tmp/stderr ++ case $* in ++ case $* in ++ /lkp/benchmarks/mdadm-selftests/mdadm --quiet /dev/md0 --fail /dev/loop2 ++ rv=0 ++ case $* in ++ cat /var/tmp/stderr ++ return 0 ++ check resync ++ case $1 in ++ cnt=5 ++ grep -sq resync /proc/mdstat ++ mdadm /dev/md0 --fail /dev/loop1 ++ rm -f /var/tmp/stderr ++