Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.
On 04/28/2017 01:05 AM, NeilBrown wrote: When an array is created the content is not initialized, so it could have remnants of an old filesystem or md array etc on it. udev will see this and might try to activate it, which is almost certainly not what is wanted. So create a mechanism for mdadm to communicate with udev to tell it that the device isn't ready. This mechanism is the existance of a file /run/mdadm/created-mdXXX where mdXXX is the md device name. When creating an array, mdadm will create the file. A new udev rule file, 01-md-raid-creating.rules, will detect the precense of thst file and set ENV{SYSTEMD_READY}="0". This is fairly uniformly used to suppress actions based on the contents of the device. Signed-off-by: NeilBrown --- Assemble.c | 2 +- Build.c | 2 +- Create.c| 9 +++- Incremental.c | 4 ++-- Makefile| 4 ++-- lib.c | 29 + mdadm.h | 4 +++- mdopen.c| 52 - udev-md-raid-creating.rules | 7 ++ 9 files changed, 86 insertions(+), 27 deletions(-) create mode 100644 udev-md-raid-creating.rules Applied! I like this solution much better, even though I am not in love with the giant usleep() call. Would be nice to find a better way around that. Sorry it took so long to get back to you on this, last week was a mess. Thanks, Jes -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag
On Tue, 2017-05-02 at 09:23 +0200, h...@lst.de wrote: > On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote: > > Or, another options is use bdev_write_zeroes_sectors() to determine when > > dev_attrib->unmap_zeroes_data should be set. > > Yes, that in combination with your patch to use bdev_write_zeroes_sectors > for zeroing from write same seems like the right fix. The larger target/iblock conversion patch looks like post v4.12 material at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll plan to push the following patch post -rc1. diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c index d2f089c..e7caf78 100644 --- a/drivers/target/target_core_device.c +++ b/drivers/target/target_core_device.c @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib *attrib, attrib->unmap_granularity = q->limits.discard_granularity / block_size; attrib->unmap_granularity_alignment = q->limits.discard_alignment / block_size; - attrib->unmap_zeroes_data = 0; + attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors); return true; } EXPORT_SYMBOL(target_configure_unmap_from_queue); -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
On 05/02/2017 07:40 AM, Peter Rajnoha wrote: On 05/01/2017 06:35 AM, NeilBrown wrote: On Fri, Apr 28 2017, Peter Rajnoha wrote: Then mdadm opens the devive, clears any old content/signatures the data area may contain, then closes it - this generates the third event - which is the "synthetic change" event (as a result of the inotify WATCH rule). And this one would drop the "not initialized" flag in udev db and the scans in udev are now enabled. Nope, it would be incorrect for mdadm to clear any old content. Sometimes people want to ignore old content. Sometimes they very definitely want to use it. It would be wrong for any code to try to guess what is wanted. The mdadm is not going to guess - it can have an option to enable/disable the wiping on demand directly on command line (which is also what is actually done in LVM). I know the anaconda team keeps pushing for the nonsense of being able to wipe drives on creation. It is wrong, it is broken, and it is not going to happen. Otherwise, if mdadm is not going to wipe/initialize the device itself, then it needs to wait for the external tool to do it (based on external configuration or on some manual wipefs-like call). So the sequence would be: 1) mdadm creating the device 2) mdadm setting up the device, marking it as "not initialized yet" 4a) mdadm waiting for the external decision to be made about wiping part 4b) external tool doing the wiping (or not) based on configuration or user's will 5) mdadm continuing and finishing when the wiping part is complete I expect mdadm to return only if the device is properly initialized otherwise it's harder for other tools called after mdadm to start working with the device - they need to poll for the state laboriously and check for readiness. What defines readiness? Some believe a raid1 array must be fully assembled with all members, other setups are happy to have one running drive in place. 4a/4b in your list here once again has no place in mdadm. Please kindly tell the anaconda team to go back and handle this properly instead. Jes -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.
On 04/28/2017 05:28 AM, Peter Rajnoha wrote: On 04/28/2017 07:05 AM, NeilBrown wrote: When an array is created the content is not initialized, so it could have remnants of an old filesystem or md array etc on it. udev will see this and might try to activate it, which is almost certainly not what is wanted. So create a mechanism for mdadm to communicate with udev to tell it that the device isn't ready. This mechanism is the existance of a file /run/mdadm/created-mdXXX where mdXXX is the md device name. When creating an array, mdadm will create the file. A new udev rule file, 01-md-raid-creating.rules, will detect the precense of thst file and set ENV{SYSTEMD_READY}="0". This is fairly uniformly used to suppress actions based on the contents of the device. The scans in udev are primarily directed by blkid call which detects the signatures and based on this information various other udev rules fire. The blkid as well as wipefs uses common libblkid library to detect these signatures - is mdadm going to use libblkid to wipe the signatures on MD device initialization or is it relying on external tools to do this? How is mdadm actually initializing/wiping the newly created MD device? mdadm doesn't wipe data and it isn't supposed to. Being able to create an array from drives with existing data is a feature. It is the responsibility of the system administrator to handle drives with existing data, in the same way the administrator is expected to handle insertion of USB drives or external drives being powered on. Jes -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v2 10/10] dm-zoned: Drive-managed zoned block device target
On Tue, 2017-05-02 at 02:53 +0900, damien.lem...@wdc.com wrote: > +static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink, > + struct shrink_control *sc) > +{ > + struct dmz_target *dmz = > + container_of(shrink, struct dmz_target, mblk_shrinker); > + > + return atomic_read(&dmz->nr_mblks); > +} Hello Damien, dmz_mblock_shrinker_count() probably should return the following value since dmz_shrink_mblock_cache() won't free more than the this number of elements: max(atomic_read(&dmz->nr_mblks) - dmz->min_nr_mblks, 0) But since v2 is IMHO good enough to be merged, for the whole series: Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
I'm sorry, but I didn't read all your words. You seemed to be telling me about extra complexity in udev, and extra complexity that you think belongs in mdadm, which together might achieve your vision for how things should work. But to me, complexity is the enemy. Give me "simple" any day. NeilBrown On Tue, May 02 2017, Peter Rajnoha wrote: > On 05/01/2017 06:35 AM, NeilBrown wrote: >> On Fri, Apr 28 2017, Peter Rajnoha wrote: >> >>> On 04/28/2017 05:55 AM, NeilBrown wrote: On Wed, Apr 26 2017, Peter Rajnoha wrote: > On 04/20/2017 11:35 PM, NeilBrown wrote: >> If we wanted an more permanent udev rule, we would need to record the >> devices that should be ignored in the filesystem somewhere else. >> Maybe in /run/mdadm. >> e.g. >> >> KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", >> ENV{SYSTEMD_READY}="0" >> >> Then we could have something like the following (untested) in mdadm. >> Does that seem more suitable? >> > > Yes, please, if possible, go for a permanent udev rule surely - this > will make it much easier for other foreign tools to hook in properly if > needed and it would also be much easier to debug. I'm leaning towards the files-in-/run/mdadm approach too. I'll make a proper patch. > > But, wouldn't it be better if we could just pass this information ("not > initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md > driver in kernel could add the variable to the uevent it generates which > userspace udev rules could check for easily. This way, we don't need to > hassle with creating files in filesystem and the information would be > directly a part of the uevent the md kernel driver generates (so > directly accessible in udev rules too). Also, possibly adding more > variables for other future scenarios if needed. When would we clear the "not initialised yet" flag in the kernel, and how? And would that be enough. >>> >>> The flag wouldn't be stored in kernel, md kernel driver would just pass >>> the flag with the uevent as it received in with ioctl/sysfs request to >>> create a new dev. The udev in userspace would handle the state >>> transition then from "flagged as not-initialized" state to "initilized" >>> by checking the sequence of events as they come. >>> >>> We should have this sequence I assume: >>> >>> 1) "add" (creating dev, not usable yet) >>> 2) "change" (activated dev, but not initialized yet) >>> 3) "synthetic change" (after wiping the dev and closing it, the WATCH >>> rule fires) >>> >> >> "Should" is arguable, but there are no guarantees of this sequence. >> >> A particular cause of irregular sequencing is when an array is assembled >> by the initrd, then after the real root is mounted, something runs >> udevadm trigger --action=add >> (e.g. systemd-udev-triggger.service). >> >> In that case, 'add' is the first and only message that the >> full-root-available udev sees, so it is not safe to ignore the array as >> "not usable yet". >> >> > > As for initrd case, there's OPTIONS+="db_persist" rule which is already > used by dracut to keep udev db records when switching from initrd to > root fs and replacing the udev daemon instance (other devices which do > not have this option set will have udev db records cleared when switched > to root fs). Not sure why it's not documented in udev yet (I already > asked for that years ago), but the support is definitely there. So > there's already a way to keep these records and to not lose track of the > current state when switching from initrd. This makes it still possible > to detect whether the synthetic "add" (or "change") event is sent before > or after proper device initialization and to make it possible to react > on the trigger (with synthetic uevents generated) still while ignoring > the "add" event if that comes from the kernel before the device is > initialized. And even if we're switching from initrd to root fs and > restarting udev daemon. > When mdadm creates an md array, at least 3 uevents get generated. The first is generated when the md device object is created, either by opening the /dev/mdXX file, or by writing magic to somewhere in sysfs. The second is generated when the array is started and the content is visible. The third is generated when mdadm closes the file descriptor. It opens /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it. Because udev uses inotify to watch for a 'close for a writable file descriptor', this generates another event. We need to ensure that none of these cause udev to run anything that inspects the contents of the array. Of the three, only the second one is directly under the control of the md module, so only that one can add an environment variable. It might be possible to avoid the last one (by not opening for wri
[dm-devel] [PATCH] dm-bufio: make the parameter "retain_bytes" unsigned long
Change the type of the parameter "retain_bytes" from unsigned to unsigned long, so that on 64-bit machines the user can set more than 4GiB of data to be retained. Also, change the type of the variable "count" in the function "__evict_old_buffers" to unsigned long. The assignment "count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];" could result in unsigned long to unsigned overflow and that could result in buffers not being freed when they should. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org --- drivers/md/dm-bufio.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) Index: linux-2.6/drivers/md/dm-bufio.c === --- linux-2.6.orig/drivers/md/dm-bufio.c +++ linux-2.6/drivers/md/dm-bufio.c @@ -228,7 +228,7 @@ static DEFINE_SPINLOCK(param_spinlock); * Buffers are freed after this timeout */ static unsigned dm_bufio_max_age = DM_BUFIO_DEFAULT_AGE_SECS; -static unsigned dm_bufio_retain_bytes = DM_BUFIO_DEFAULT_RETAIN_BYTES; +static unsigned long dm_bufio_retain_bytes = DM_BUFIO_DEFAULT_RETAIN_BYTES; static unsigned long dm_bufio_peak_allocated; static unsigned long dm_bufio_allocated_kmem_cache; @@ -1597,9 +1597,9 @@ static bool __try_evict_buffer(struct dm return true; } -static unsigned get_retain_buffers(struct dm_bufio_client *c) +static unsigned long get_retain_buffers(struct dm_bufio_client *c) { -unsigned retain_bytes = ACCESS_ONCE(dm_bufio_retain_bytes); +unsigned long retain_bytes = ACCESS_ONCE(dm_bufio_retain_bytes); return retain_bytes >> (c->sectors_per_block_bits + SECTOR_SHIFT); } @@ -1610,7 +1610,7 @@ static unsigned long __scan(struct dm_bu struct dm_buffer *b, *tmp; unsigned long freed = 0; unsigned long count = nr_to_scan; - unsigned retain_target = get_retain_buffers(c); + unsigned long retain_target = get_retain_buffers(c); for (l = 0; l < LIST_SIZE; l++) { list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) { @@ -1833,8 +1833,8 @@ static bool older_than(struct dm_buffer static void __evict_old_buffers(struct dm_bufio_client *c, unsigned long age_hz) { struct dm_buffer *b, *tmp; - unsigned retain_target = get_retain_buffers(c); - unsigned count; + unsigned long retain_target = get_retain_buffers(c); + unsigned long count; LIST_HEAD(write_list); dm_bufio_lock(c); @@ -1994,7 +1994,7 @@ MODULE_PARM_DESC(max_cache_size_bytes, " module_param_named(max_age_seconds, dm_bufio_max_age, uint, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(max_age_seconds, "Max age of a buffer in seconds"); -module_param_named(retain_bytes, dm_bufio_retain_bytes, uint, S_IRUGO | S_IWUSR); +module_param_named(retain_bytes, dm_bufio_retain_bytes, ulong, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(retain_bytes, "Try to keep at least this many bytes cached in memory"); module_param_named(peak_allocated_bytes, dm_bufio_peak_allocated, ulong, S_IRUGO | S_IWUSR); -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [git pull] device mapper post-merge changes for 4.12
Hi Linus, Here are some changes from Christoph that needed to be rebased ontop of changes that were already merged into the 'tags/for-4.12/dm-changes'. In addition, these changes depend on the 'for-4.12/block' changes that you've already merged. So once you've pulled 'tags/for-4.12/dm-changes' please then consider pulling these changes. The following changes since commit 7e25a7606147bfe29a7421ff2cb332b07d3cee3a: Merge branch 'dm-4.12' into dm-4.12-post-merge (2017-05-01 18:18:04 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-4.12/dm-post-merge-changes for you to fetch changes up to 412445acb6cad4cef026daae37c4765fb9942c60: dm: introduce a new DM_MAPIO_KILL return value (2017-05-01 18:19:03 -0400) Please pull, thanks. Mike - Cleanups to request-based DM and DM multipath from Christoph that prepare for his block core error code type checking improvements. Christoph Hellwig (3): dm mpath: merge do_end_io into multipath_end_io dm rq: change ->rq_end_io calling conventions dm: introduce a new DM_MAPIO_KILL return value drivers/md/dm-mpath.c | 54 +-- drivers/md/dm-rq.c| 29 --- drivers/md/dm-target.c| 2 +- include/linux/device-mapper.h | 2 ++ 4 files changed, 39 insertions(+), 48 deletions(-) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [git pull] device mapper changes for 4.12
Hi Linus, The following changes since commit c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201: Linux 4.11-rc1 (2017-03-05 12:59:56 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-4.12/dm-changes for you to fetch changes up to 390020ad2af9ca04844c4f3b1f299ad8746d84c8: dm bufio: check new buffer allocation watermark every 30 seconds (2017-05-01 15:21:42 -0400) Please pull, thanks. Mike - A major update for DM cache that reduces the latency for deciding whether blocks should migrate to/from the cache. The bio-prison-v2 interface supports this improvement by enabling direct dispatch of work to workqueues rather than having to delay the actual work dispatch to the DM cache core. So the dm-cache policies are much more nimble by being able to drive IO as they see fit. One immediate benefit from the improved latency is a cache that should be much more adaptive to changing workloads. - Add a new DM integrity target that emulates a block device that has additional per-sector tags that can be used for storing integrity information. - Add a new authenticated encryption feature to the DM crypt target that builds on the capabilities provided by the DM integrity target. - Add MD interface for switching the raid4/5/6 journal mode and update the DM raid target to use it to enable aid4/5/6 journal write-back support. - Switch the DM verity target over to using the asynchronous hash crypto API (this helps work better with architectures that have access to off-CPU algorithm providers, which should reduce CPU utilization). - Various request-based DM and DM multipath fixes and improvements from Bart and Christoph. - A DM thinp target fix for a bio structure leak that occurs for each discard IFF discard passdown is enabled. - A fix for a possible deadlock in DM bufio and a fix to re-check the new buffer allocation watermark in the face of competing admin changes to the 'max_cache_size_bytes' tunable. - A couple DM core cleanups. Adrian Salido (1): dm ioctl: prevent stack leak in dm ioctl call Andy Shevchenko (1): dm crypt: replace custom implementation of hex2bin() Bart Van Assche (12): dm mpath: requeue after a small delay if blk_get_request() fails dm mpath: split and rename activate_path() to prepare for its expanded use dm mpath: avoid that path removal can trigger an infinite loop dm mpath: delay requeuing while path initialization is in progress dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue() dm block manager: remove an unused argument from dm_block_manager_create() dm: verify suspend_locking assumptions at runtime dm mpath: verify __pg_init_all_paths locking assumptions at runtime dm: introduce enum dm_queue_mode to cleanup related code dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit() dm mpath: make it easier to detect unintended I/O request flushes Dennis Yang (1): dm thin: fix a memory leak when passing discard bio down Eric Biggers (1): dm crypt: remove obsolete references to per-CPU state Gilad Ben-Yossef (1): dm verity: switch to using asynchronous hash crypto API Heinz Mauelshagen (3): md: add raid4/5/6 journal mode switching API dm raid: fix table line argument order in status dm raid: add raid4/5/6 journal write-back support via journal_mode option Joe Thornber (4): dm bio prison v2: new interface for the bio prison dm cache: significant rework to leverage dm-bio-prison-v2 dm cache: set/clear the cache core's dirty_bitset when loading mappings dm cache policy smq: make the cleaner policy write-back more aggressively Matthias Kaehlcke (1): dm ioctl: remove double parentheses Mike Snitzer (1): dm integrity: factor out create_journal() from dm_integrity_ctr() Mikulas Patocka (15): dm bufio: add sector start offset to dm-bufio interface dm: add integrity target dm integrity: add recovery mode dm crypt: use shifts instead of sector_div dm raid: select the Kconfig option CONFIG_MD_RAID0 dm table: replace while loops with for loops dm: mark targets that pass integrity data dm integrity: various small changes and cleanups dm integrity: support larger block sizes dm crypt: fix large block integrity support dm: remove dummy dm_table definition dm integrity: use hex2bin instead of open-coded variant dm integrity: use previously calculated log2 of sectors_per_block dm bufio: avoid a possible ABBA deadlock dm bufio: check new buffer allocation watermark every 30 seconds Milan Br
[dm-devel] [PATCH 0/7] Fix fallout from changes to FUA and PREFLUSH definitions
Hello, this series addresses a performance issue caused by commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous". We know for certain this problem significanly regresses (over 10%, in some cases up to 100%) ext4 and btrfs for dbench4 and reaim benchmarks. Based on this I have fixed up also other places which suffer from the same problem however those changes are untested so maintainers please have a look whether the change makes sense to you and also whether I possibly didn't miss some cases where REQ_SYNC should be also added. Patches in this series are completely independent so if maintainers agree with the change, feel free to take it through your tree. The core of the problem is that above mentioned commit removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions. A side note for ext4: The two patches for ext4 & jbd2 are on top of the change that got merged in the ext4 tree already. Honza -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH 7/7] md: Make flush bios explicitely sync
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. CC: linux-r...@vger.kernel.org CC: Shaohua Li CC: Mike Snitzer CC: dm-devel@redhat.com Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3 Signed-off-by: Jan Kara --- drivers/md/dm-snap-persistent.c | 3 ++- drivers/md/md.c | 2 +- drivers/md/raid5-cache.c| 4 ++-- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/md/dm-snap-persistent.c b/drivers/md/dm-snap-persistent.c index b93476c3ba3f..b92ab4cb0710 100644 --- a/drivers/md/dm-snap-persistent.c +++ b/drivers/md/dm-snap-persistent.c @@ -741,7 +741,8 @@ static void persistent_commit_exception(struct dm_exception_store *store, /* * Commit exceptions to disk. */ - if (ps->valid && area_io(ps, REQ_OP_WRITE, REQ_PREFLUSH | REQ_FUA)) + if (ps->valid && area_io(ps, REQ_OP_WRITE, +REQ_SYNC | REQ_PREFLUSH | REQ_FUA)) ps->valid = 0; /* diff --git a/drivers/md/md.c b/drivers/md/md.c index f6ae1d67bcd0..9c40ce0642c2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -753,7 +753,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev, test_bit(FailFast, &rdev->flags) && !test_bit(LastDev, &rdev->flags)) ff = MD_FAILFAST; - bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA | ff; + bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH | REQ_FUA | ff; atomic_inc(&mddev->pending_writes); submit_bio(bio); diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c index 3f307be01b10..0bd5d0c88cee 100644 --- a/drivers/md/raid5-cache.c +++ b/drivers/md/raid5-cache.c @@ -1625,7 +1625,7 @@ static int r5l_log_write_empty_meta_block(struct r5l_log *log, sector_t pos, mb->checksum = cpu_to_le32(crc32c_le(log->uuid_checksum, mb, PAGE_SIZE)); if (!sync_page_io(log->rdev, pos, PAGE_SIZE, page, REQ_OP_WRITE, - REQ_FUA, false)) { + REQ_SYNC | REQ_FUA, false)) { __free_page(page); return -EIO; } @@ -2198,7 +2198,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log, mb->checksum = cpu_to_le32(crc32c_le(log->uuid_checksum, mb, PAGE_SIZE)); sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page, -REQ_OP_WRITE, REQ_FUA, false); +REQ_OP_WRITE, REQ_SYNC | REQ_FUA, false); sh->log_start = ctx->pos; list_add_tail(&sh->r5c, &log->stripe_in_journal_list); atomic_inc(&log->stripe_in_journal_count); -- 2.12.0 -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
On 05/01/2017 06:35 AM, NeilBrown wrote: > On Fri, Apr 28 2017, Peter Rajnoha wrote: > >> On 04/28/2017 05:55 AM, NeilBrown wrote: >>> On Wed, Apr 26 2017, Peter Rajnoha wrote: >>> On 04/20/2017 11:35 PM, NeilBrown wrote: > If we wanted an more permanent udev rule, we would need to record the > devices that should be ignored in the filesystem somewhere else. > Maybe in /run/mdadm. > e.g. > > KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0" > > Then we could have something like the following (untested) in mdadm. > Does that seem more suitable? > Yes, please, if possible, go for a permanent udev rule surely - this will make it much easier for other foreign tools to hook in properly if needed and it would also be much easier to debug. >>> >>> I'm leaning towards the files-in-/run/mdadm approach too. I'll make a >>> proper patch. >>> But, wouldn't it be better if we could just pass this information ("not initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md driver in kernel could add the variable to the uevent it generates which userspace udev rules could check for easily. This way, we don't need to hassle with creating files in filesystem and the information would be directly a part of the uevent the md kernel driver generates (so directly accessible in udev rules too). Also, possibly adding more variables for other future scenarios if needed. >>> >>> When would we clear the "not initialised yet" flag in the kernel, and >>> how? And would that be enough. >> >> The flag wouldn't be stored in kernel, md kernel driver would just pass >> the flag with the uevent as it received in with ioctl/sysfs request to >> create a new dev. The udev in userspace would handle the state >> transition then from "flagged as not-initialized" state to "initilized" >> by checking the sequence of events as they come. >> >> We should have this sequence I assume: >> >> 1) "add" (creating dev, not usable yet) >> 2) "change" (activated dev, but not initialized yet) >> 3) "synthetic change" (after wiping the dev and closing it, the WATCH >> rule fires) >> > > "Should" is arguable, but there are no guarantees of this sequence. > > A particular cause of irregular sequencing is when an array is assembled > by the initrd, then after the real root is mounted, something runs > udevadm trigger --action=add > (e.g. systemd-udev-triggger.service). > > In that case, 'add' is the first and only message that the > full-root-available udev sees, so it is not safe to ignore the array as > "not usable yet". > > As for initrd case, there's OPTIONS+="db_persist" rule which is already used by dracut to keep udev db records when switching from initrd to root fs and replacing the udev daemon instance (other devices which do not have this option set will have udev db records cleared when switched to root fs). Not sure why it's not documented in udev yet (I already asked for that years ago), but the support is definitely there. So there's already a way to keep these records and to not lose track of the current state when switching from initrd. This makes it still possible to detect whether the synthetic "add" (or "change") event is sent before or after proper device initialization and to make it possible to react on the trigger (with synthetic uevents generated) still while ignoring the "add" event if that comes from the kernel before the device is initialized. And even if we're switching from initrd to root fs and restarting udev daemon. >>> >>> When mdadm creates an md array, at least 3 uevents get generated. >>> The first is generated when the md device object is created, either by >>> opening the /dev/mdXX file, or by writing magic to somewhere in sysfs. >>> The second is generated when the array is started and the content is >>> visible. >>> The third is generated when mdadm closes the file descriptor. It opens >>> /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it. >>> Because udev uses inotify to watch for a 'close for a writable file >>> descriptor', this generates another event. >>> >>> We need to ensure that none of these cause udev to run anything that >>> inspects the contents of the array. >>> Of the three, only the second one is directly under the control of the >>> md module, so only that one can add an environment variable. >>> >>> It might be possible to avoid the last one (by not opening for write), >>> but I cannot see a way to avoid the first one. >> >> So the first event is the "add" event ("md device object created") - in >> this case, the device is not yet usable anyway I suppose, so we can skip >> udev scans for this event right away (unless it's the synthetic "add" >> event which is generated by writing "add" to /sys/block/.../uevent file >> or alternatively using udevadm trigger - but we should be able to >> recognize this event "synthetic add even
Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag
On Mon, 2017-05-01 at 23:43 -0700, Nicholas A. Bellinger wrote: > On Mon, 2017-05-01 at 20:45 +, Bart Van Assche wrote: > > On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote: > > > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can > > > kill this hack. > > > > > > Signed-off-by: Christoph Hellwig > > > Reviewed-by: Martin K. Petersen > > > Reviewed-by: Hannes Reinecke > > > [ ... ] > > > diff --git a/drivers/target/target_core_device.c > > > b/drivers/target/target_core_device.c > > > index c754ae33bf7b..d2f089cfa9ae 100644 > > > --- a/drivers/target/target_core_device.c > > > +++ b/drivers/target/target_core_device.c > > > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct > > > se_dev_attrib *attrib, > > > attrib->unmap_granularity = q->limits.discard_granularity / block_size; > > > attrib->unmap_granularity_alignment = q->limits.discard_alignment / > > > block_size; > > > - attrib->unmap_zeroes_data = q->limits.discard_zeroes_data; > > > + attrib->unmap_zeroes_data = 0; > > > return true; > > > } > > > EXPORT_SYMBOL(target_configure_unmap_from_queue); > > > > Hello Christoph, > > > > Sorry that I hadn't noticed this before but I think that this patch > > introduces a significant performance regressions for LIO users. Before > > this patch the LBPRZ flag was reported correctly to initiator systems > > through the thin provisioning VPD. With this patch applied that flag > > will always be reported as zero, forcing initiators to submit WRITE > > commands with zeroed data buffers instead of submitting the SCSI UNMAP > > command to block devices for which discard_zeroes_data was set. From > > target_core_spc.c: > > > > /* Thin Provisioning VPD */ > > static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char > > *buf) > > { > > [ ... ] > > /* > > * The unmap_zeroes_data set means that the underlying device supports > > * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies > > * the SBC requirements for LBPRZ, meaning that a subsequent read > > * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA > > * See sbc4r36 6.6.4. > > */ > > if (((dev->dev_attrib.emulate_tpu != 0) || > > (dev->dev_attrib.emulate_tpws != 0)) && > > (dev->dev_attrib.unmap_zeroes_data != 0)) > > buf[5] |= 0x04; > > [ ... ] > > } > > > > According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and > SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with. > For UNMAP, q->limits.discard_zeroes_data was never set. > > That said, it's pretty much implied that supporting DISCARD means > subsequent READs return zeros, so target_configure_unmap_from_queue() > should be setting attrib->unmap_zeroes_data = 1, or just dropping it > all-together. > > Post -rc1, I'll push a patch to do the latter. > Or, another options is use bdev_write_zeroes_sectors() to determine when dev_attrib->unmap_zeroes_data should be set. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v2 07/21] crypto: shash, caam: Make use of the new sg_map helper function
On 28/04/17 11:51 AM, Herbert Xu wrote: > On Fri, Apr 28, 2017 at 10:53:45AM -0600, Logan Gunthorpe wrote: >> >> >> On 28/04/17 12:30 AM, Herbert Xu wrote: >>> You are right. Indeed the existing code looks buggy as they >>> don't take sg->offset into account when doing the kmap. Could >>> you send me some patches that fix these problems first so that >>> they can be easily backported? >> >> Ok, I think the only buggy one in crypto is hifn_795x. Shash and caam >> both do have the sg->offset accounted for. I'll send a patch for the >> buggy one shortly. > > I think they're all buggy when sg->offset is greater than PAGE_SIZE. Yes, technically. But that's a _very_ common mistake. Pretty nearly every case I looked at did not take that into account. I don't think sg's that point to more than one continuous page are all that common. Fixing all those cases without making a common function is a waste of time IMO. Logan -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag
On Mon, 2017-05-01 at 20:45 +, Bart Van Assche wrote: > On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote: > > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can > > kill this hack. > > > > Signed-off-by: Christoph Hellwig > > Reviewed-by: Martin K. Petersen > > Reviewed-by: Hannes Reinecke > > [ ... ] > > diff --git a/drivers/target/target_core_device.c > > b/drivers/target/target_core_device.c > > index c754ae33bf7b..d2f089cfa9ae 100644 > > --- a/drivers/target/target_core_device.c > > +++ b/drivers/target/target_core_device.c > > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct > > se_dev_attrib *attrib, > > attrib->unmap_granularity = q->limits.discard_granularity / block_size; > > attrib->unmap_granularity_alignment = q->limits.discard_alignment / > > block_size; > > - attrib->unmap_zeroes_data = q->limits.discard_zeroes_data; > > + attrib->unmap_zeroes_data = 0; > > return true; > > } > > EXPORT_SYMBOL(target_configure_unmap_from_queue); > > Hello Christoph, > > Sorry that I hadn't noticed this before but I think that this patch > introduces a significant performance regressions for LIO users. Before > this patch the LBPRZ flag was reported correctly to initiator systems > through the thin provisioning VPD. With this patch applied that flag > will always be reported as zero, forcing initiators to submit WRITE > commands with zeroed data buffers instead of submitting the SCSI UNMAP > command to block devices for which discard_zeroes_data was set. From > target_core_spc.c: > > /* Thin Provisioning VPD */ > static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char > *buf) > { > [ ... ] > /* >* The unmap_zeroes_data set means that the underlying device supports >* REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies >* the SBC requirements for LBPRZ, meaning that a subsequent read >* will return zeroes after an UNMAP or WRITE SAME (16) to an LBA >* See sbc4r36 6.6.4. >*/ > if (((dev->dev_attrib.emulate_tpu != 0) || >(dev->dev_attrib.emulate_tpws != 0)) && >(dev->dev_attrib.unmap_zeroes_data != 0)) > buf[5] |= 0x04; > [ ... ] > } > According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with. For UNMAP, q->limits.discard_zeroes_data was never set. That said, it's pretty much implied that supporting DISCARD means subsequent READs return zeros, so target_configure_unmap_from_queue() should be setting attrib->unmap_zeroes_data = 1, or just dropping it all-together. Post -rc1, I'll push a patch to do the latter. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v2 07/21] crypto: shash, caam: Make use of the new sg_map helper function
On 28/04/17 12:30 AM, Herbert Xu wrote: > You are right. Indeed the existing code looks buggy as they > don't take sg->offset into account when doing the kmap. Could > you send me some patches that fix these problems first so that > they can be easily backported? Ok, I think the only buggy one in crypto is hifn_795x. Shash and caam both do have the sg->offset accounted for. I'll send a patch for the buggy one shortly. Logan -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag
On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote: > Or, another options is use bdev_write_zeroes_sectors() to determine when > dev_attrib->unmap_zeroes_data should be set. Yes, that in combination with your patch to use bdev_write_zeroes_sectors for zeroing from write same seems like the right fix. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel