Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.

2017-05-02 Thread Jes Sorensen

On 04/28/2017 01:05 AM, NeilBrown wrote:


When an array is created the content is not initialized,
so it could have remnants of an old filesystem or md array
etc on it.
udev will see this and might try to activate it, which is almost
certainly not what is wanted.

So create a mechanism for mdadm to communicate with udev to tell
it that the device isn't ready.  This mechanism is the existance
of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.

When creating an array, mdadm will create the file.
A new udev rule file, 01-md-raid-creating.rules, will detect the
precense of thst file and set ENV{SYSTEMD_READY}="0".
This is fairly uniformly used to suppress actions based on the
contents of the device.

Signed-off-by: NeilBrown 
---
 Assemble.c  |  2 +-
 Build.c |  2 +-
 Create.c|  9 +++-
 Incremental.c   |  4 ++--
 Makefile|  4 ++--
 lib.c   | 29 +
 mdadm.h |  4 +++-
 mdopen.c| 52 -
 udev-md-raid-creating.rules |  7 ++
 9 files changed, 86 insertions(+), 27 deletions(-)
 create mode 100644 udev-md-raid-creating.rules


Applied!

I like this solution much better, even though I am not in love with the 
giant usleep() call. Would be nice to find a better way around that.


Sorry it took so long to get back to you on this, last week was a mess.

Thanks,
Jes


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-02 Thread Nicholas A. Bellinger
On Tue, 2017-05-02 at 09:23 +0200, h...@lst.de wrote:
> On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > dev_attrib->unmap_zeroes_data should be set.
> 
> Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> for zeroing from write same seems like the right fix.

The larger target/iblock conversion patch looks like post v4.12 material
at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
plan to push the following patch post -rc1.

diff --git a/drivers/target/target_core_device.c 
b/drivers/target/target_core_device.c
index d2f089c..e7caf78 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib 
*attrib,
attrib->unmap_granularity = q->limits.discard_granularity / block_size;
attrib->unmap_granularity_alignment = q->limits.discard_alignment /
block_size;
-   attrib->unmap_zeroes_data = 0;
+   attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
return true;
 }
 EXPORT_SYMBOL(target_configure_unmap_from_queue);

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.

2017-05-02 Thread Jes Sorensen

On 05/02/2017 07:40 AM, Peter Rajnoha wrote:

On 05/01/2017 06:35 AM, NeilBrown wrote:

On Fri, Apr 28 2017, Peter Rajnoha wrote:

Then mdadm opens the devive, clears any old content/signatures the data
area may contain, then closes it - this generates the third event -
which is the "synthetic change" event (as a result of the inotify WATCH
rule). And this one would drop the "not initialized" flag in udev db and
the scans in udev are now enabled.


Nope, it would be incorrect for mdadm to clear any old content.
Sometimes people want to ignore old content.  Sometimes they very
definitely want to use it.  It would be wrong for any code to try to
guess what is wanted.


The mdadm is not going to guess - it can have an option to
enable/disable the wiping on demand directly on command line (which is
also what is actually done in LVM).


I know the anaconda team keeps pushing for the nonsense of being able to 
wipe drives on creation. It is wrong, it is broken, and it is not going 
to happen.



Otherwise, if mdadm is not going to wipe/initialize the device itself,
then it needs to wait for the external tool to do it (based on external
configuration or on some manual wipefs-like call). So the sequence would be:

 1) mdadm creating the device
 2) mdadm setting up the device, marking it as "not initialized yet"
 4a) mdadm waiting for the external decision to be made about wiping part
 4b) external tool doing the wiping (or not) based on configuration or
user's will
 5) mdadm continuing and finishing when the wiping part is complete

I expect mdadm to return only if the device is properly initialized
otherwise it's harder for other tools called after mdadm to start
working with the device - they need to poll for the state laboriously
and check for readiness.


What defines readiness? Some believe a raid1 array must be fully 
assembled with all members, other setups are happy to have one running 
drive in place.


4a/4b in your list here once again has no place in mdadm. Please kindly 
tell the anaconda team to go back and handle this properly instead.


Jes

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.

2017-05-02 Thread Jes Sorensen

On 04/28/2017 05:28 AM, Peter Rajnoha wrote:

On 04/28/2017 07:05 AM, NeilBrown wrote:


When an array is created the content is not initialized,
so it could have remnants of an old filesystem or md array
etc on it.
udev will see this and might try to activate it, which is almost
certainly not what is wanted.

So create a mechanism for mdadm to communicate with udev to tell
it that the device isn't ready.  This mechanism is the existance
of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.

When creating an array, mdadm will create the file.
A new udev rule file, 01-md-raid-creating.rules, will detect the
precense of thst file and set ENV{SYSTEMD_READY}="0".
This is fairly uniformly used to suppress actions based on the
contents of the device.


The scans in udev are primarily directed by blkid call which detects the
signatures and based on this information various other udev rules fire.

The blkid as well as wipefs uses common libblkid library to detect these
signatures - is mdadm going to use libblkid to wipe the signatures on MD
device initialization or is it relying on external tools to do this? How
is mdadm actually initializing/wiping the newly created MD device?


mdadm doesn't wipe data and it isn't supposed to. Being able to create 
an array from drives with existing data is a feature.


It is the responsibility of the system administrator to handle drives 
with existing data, in the same way the administrator is expected to 
handle insertion of USB drives or external drives being powered on.


Jes

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 10/10] dm-zoned: Drive-managed zoned block device target

2017-05-02 Thread Bart Van Assche
On Tue, 2017-05-02 at 02:53 +0900, damien.lem...@wdc.com wrote:
> +static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink,
> +  struct shrink_control *sc)
> +{
> +   struct dmz_target *dmz =
> +   container_of(shrink, struct dmz_target, mblk_shrinker);
> +
> +   return atomic_read(&dmz->nr_mblks);
> +}

Hello Damien,

dmz_mblock_shrinker_count() probably should return the following value since
dmz_shrink_mblock_cache() won't free more than the this number of elements:

max(atomic_read(&dmz->nr_mblks) - dmz->min_nr_mblks, 0) 

But since v2 is IMHO good enough to be merged, for the whole series:

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.

2017-05-02 Thread NeilBrown

I'm sorry, but I didn't read all your words.

You seemed to be telling me about extra complexity in udev, and extra
complexity that you think belongs in mdadm, which together might achieve
your vision for how things should work.

But to me, complexity is the enemy.  Give me "simple" any day.

NeilBrown


On Tue, May 02 2017, Peter Rajnoha wrote:

> On 05/01/2017 06:35 AM, NeilBrown wrote:
>> On Fri, Apr 28 2017, Peter Rajnoha wrote:
>> 
>>> On 04/28/2017 05:55 AM, NeilBrown wrote:
 On Wed, Apr 26 2017, Peter Rajnoha wrote:

> On 04/20/2017 11:35 PM, NeilBrown wrote:
>> If we wanted an more permanent udev rule, we would need to record the
>> devices that should be ignored in the filesystem somewhere else.
>> Maybe in /run/mdadm.
>> e.g.
>>
>>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", 
>> ENV{SYSTEMD_READY}="0"
>>
>> Then we could have something like the following (untested) in mdadm.
>> Does that seem more suitable?
>>
>
> Yes, please, if possible, go for a permanent udev rule surely - this
> will make it much easier for other foreign tools to hook in properly if
> needed and it would also be much easier to debug.

 I'm leaning towards the files-in-/run/mdadm approach too.  I'll make a
 proper patch.

>
> But, wouldn't it be better if we could just pass this information ("not
> initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
> driver in kernel could add the variable to the uevent it generates which
> userspace udev rules could check for easily. This way, we don't need to
> hassle with creating files in filesystem and the information would be
> directly a part of the uevent the md kernel driver generates (so
> directly accessible in udev rules too). Also, possibly adding more
> variables for other future scenarios if needed.

 When would we clear the "not initialised yet" flag in the kernel, and
 how?  And would that be enough.
>>>
>>> The flag wouldn't be stored in kernel, md kernel driver would just pass
>>> the flag with the uevent as it received in with ioctl/sysfs request to
>>> create a new dev. The udev in userspace would handle the state
>>> transition then from "flagged as not-initialized" state to "initilized"
>>> by checking the sequence of events as they come.
>>>
>>> We should have this sequence I assume:
>>>
>>>   1) "add" (creating dev, not usable yet)
>>>   2) "change" (activated dev, but not initialized yet)
>>>   3) "synthetic change" (after wiping the dev and closing it, the WATCH
>>> rule fires)
>>>
>> 
>> "Should" is arguable, but there are no guarantees of this sequence.
>> 
>> A particular cause of irregular sequencing is when an array is assembled
>> by the initrd, then after the real root is mounted, something runs
>>   udevadm trigger --action=add
>> (e.g. systemd-udev-triggger.service).
>> 
>> In that case, 'add' is the first and only message that the
>> full-root-available udev sees, so it is not safe to ignore the array as
>> "not usable yet".
>> 
>> 
>
> As for initrd case, there's OPTIONS+="db_persist" rule which is already
> used by dracut to keep udev db records when switching from initrd to
> root fs and replacing the udev daemon instance (other devices which do
> not have this option set will have udev db records cleared when switched
> to root fs). Not sure why it's not documented in udev yet (I already
> asked for that years ago), but the support is definitely there. So
> there's already a way to keep these records and to not lose track of the
> current state when switching from initrd. This makes it still possible
> to detect whether the synthetic "add" (or "change") event is sent before
> or after proper device initialization and to make it possible to react
> on the trigger (with synthetic uevents generated) still while ignoring
> the "add" event if that comes from the kernel before the device is
> initialized. And even if we're switching from initrd to root fs and
> restarting udev daemon.
>

 When mdadm creates an md array, at least 3 uevents get generated.
 The first is generated when the md device object is created, either by
 opening the /dev/mdXX file, or by writing magic to somewhere in sysfs.
 The second is generated when the array is started and the content is
 visible.
 The third is generated when mdadm closes the file descriptor.  It opens
 /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it.
 Because udev uses inotify to watch for a 'close for a writable file
 descriptor', this generates another event.

 We need to ensure that none of these cause udev to run anything that
 inspects the contents of the array.
 Of the three, only the second one is directly under the control of the
 md module, so only that one can add an environment variable.

 It might be possible to avoid the last one (by not opening for wri

[dm-devel] [PATCH] dm-bufio: make the parameter "retain_bytes" unsigned long

2017-05-02 Thread Mikulas Patocka
Change the type of the parameter "retain_bytes" from unsigned to unsigned
long, so that on 64-bit machines the user can set more than 4GiB of data
to be retained.

Also, change the type of the variable "count" in the function
"__evict_old_buffers" to unsigned long. The assignment
"count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];"
could result in unsigned long to unsigned overflow and that could result
in buffers not being freed when they should.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org

---
 drivers/md/dm-bufio.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/md/dm-bufio.c
===
--- linux-2.6.orig/drivers/md/dm-bufio.c
+++ linux-2.6/drivers/md/dm-bufio.c
@@ -228,7 +228,7 @@ static DEFINE_SPINLOCK(param_spinlock);
  * Buffers are freed after this timeout
  */
 static unsigned dm_bufio_max_age = DM_BUFIO_DEFAULT_AGE_SECS;
-static unsigned dm_bufio_retain_bytes = DM_BUFIO_DEFAULT_RETAIN_BYTES;
+static unsigned long dm_bufio_retain_bytes = DM_BUFIO_DEFAULT_RETAIN_BYTES;
 
 static unsigned long dm_bufio_peak_allocated;
 static unsigned long dm_bufio_allocated_kmem_cache;
@@ -1597,9 +1597,9 @@ static bool __try_evict_buffer(struct dm
return true;
 }
 
-static unsigned get_retain_buffers(struct dm_bufio_client *c)
+static unsigned long get_retain_buffers(struct dm_bufio_client *c)
 {
-unsigned retain_bytes = ACCESS_ONCE(dm_bufio_retain_bytes);
+unsigned long retain_bytes = ACCESS_ONCE(dm_bufio_retain_bytes);
 return retain_bytes >> (c->sectors_per_block_bits + SECTOR_SHIFT);
 }
 
@@ -1610,7 +1610,7 @@ static unsigned long __scan(struct dm_bu
struct dm_buffer *b, *tmp;
unsigned long freed = 0;
unsigned long count = nr_to_scan;
-   unsigned retain_target = get_retain_buffers(c);
+   unsigned long retain_target = get_retain_buffers(c);
 
for (l = 0; l < LIST_SIZE; l++) {
list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
@@ -1833,8 +1833,8 @@ static bool older_than(struct dm_buffer
 static void __evict_old_buffers(struct dm_bufio_client *c, unsigned long 
age_hz)
 {
struct dm_buffer *b, *tmp;
-   unsigned retain_target = get_retain_buffers(c);
-   unsigned count;
+   unsigned long retain_target = get_retain_buffers(c);
+   unsigned long count;
LIST_HEAD(write_list);
 
dm_bufio_lock(c);
@@ -1994,7 +1994,7 @@ MODULE_PARM_DESC(max_cache_size_bytes, "
 module_param_named(max_age_seconds, dm_bufio_max_age, uint, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(max_age_seconds, "Max age of a buffer in seconds");
 
-module_param_named(retain_bytes, dm_bufio_retain_bytes, uint, S_IRUGO | 
S_IWUSR);
+module_param_named(retain_bytes, dm_bufio_retain_bytes, ulong, S_IRUGO | 
S_IWUSR);
 MODULE_PARM_DESC(retain_bytes, "Try to keep at least this many bytes cached in 
memory");
 
 module_param_named(peak_allocated_bytes, dm_bufio_peak_allocated, ulong, 
S_IRUGO | S_IWUSR);

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [git pull] device mapper post-merge changes for 4.12

2017-05-02 Thread Mike Snitzer
Hi Linus,

Here are some changes from Christoph that needed to be rebased ontop of
changes that were already merged into the 'tags/for-4.12/dm-changes'.
In addition, these changes depend on the 'for-4.12/block' changes that
you've already merged.  So once you've pulled 'tags/for-4.12/dm-changes'
please then consider pulling these changes.

The following changes since commit 7e25a7606147bfe29a7421ff2cb332b07d3cee3a:

  Merge branch 'dm-4.12' into dm-4.12-post-merge (2017-05-01 18:18:04 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-4.12/dm-post-merge-changes

for you to fetch changes up to 412445acb6cad4cef026daae37c4765fb9942c60:

  dm: introduce a new DM_MAPIO_KILL return value (2017-05-01 18:19:03 -0400)

Please pull, thanks.
Mike


- Cleanups to request-based DM and DM multipath from Christoph that
  prepare for his block core error code type checking improvements.


Christoph Hellwig (3):
  dm mpath: merge do_end_io into multipath_end_io
  dm rq: change ->rq_end_io calling conventions
  dm: introduce a new DM_MAPIO_KILL return value

 drivers/md/dm-mpath.c | 54 +--
 drivers/md/dm-rq.c| 29 ---
 drivers/md/dm-target.c|  2 +-
 include/linux/device-mapper.h |  2 ++
 4 files changed, 39 insertions(+), 48 deletions(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [git pull] device mapper changes for 4.12

2017-05-02 Thread Mike Snitzer
Hi Linus,

The following changes since commit c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201:

  Linux 4.11-rc1 (2017-03-05 12:59:56 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-4.12/dm-changes

for you to fetch changes up to 390020ad2af9ca04844c4f3b1f299ad8746d84c8:

  dm bufio: check new buffer allocation watermark every 30 seconds (2017-05-01 
15:21:42 -0400)

Please pull, thanks.
Mike


- A major update for DM cache that reduces the latency for deciding
  whether blocks should migrate to/from the cache.  The bio-prison-v2
  interface supports this improvement by enabling direct dispatch of
  work to workqueues rather than having to delay the actual work
  dispatch to the DM cache core.  So the dm-cache policies are much more
  nimble by being able to drive IO as they see fit.  One immediate
  benefit from the improved latency is a cache that should be much more
  adaptive to changing workloads.

- Add a new DM integrity target that emulates a block device that has
  additional per-sector tags that can be used for storing integrity
  information.

- Add a new authenticated encryption feature to the DM crypt target that
  builds on the capabilities provided by the DM integrity target.

- Add MD interface for switching the raid4/5/6 journal mode and update
  the DM raid target to use it to enable aid4/5/6 journal write-back
  support.

- Switch the DM verity target over to using the asynchronous hash crypto
  API (this helps work better with architectures that have access to
  off-CPU algorithm providers, which should reduce CPU utilization).

- Various request-based DM and DM multipath fixes and improvements from
  Bart and Christoph.

- A DM thinp target fix for a bio structure leak that occurs for each
  discard IFF discard passdown is enabled.

- A fix for a possible deadlock in DM bufio and a fix to re-check the
  new buffer allocation watermark in the face of competing admin changes
  to the 'max_cache_size_bytes' tunable.

- A couple DM core cleanups.


Adrian Salido (1):
  dm ioctl: prevent stack leak in dm ioctl call

Andy Shevchenko (1):
  dm crypt: replace custom implementation of hex2bin()

Bart Van Assche (12):
  dm mpath: requeue after a small delay if blk_get_request() fails
  dm mpath: split and rename activate_path() to prepare for its expanded use
  dm mpath: avoid that path removal can trigger an infinite loop
  dm mpath: delay requeuing while path initialization is in progress
  dm rq: check blk_mq_register_dev() return value in 
dm_mq_init_request_queue()
  dm block manager: remove an unused argument from dm_block_manager_create()
  dm: verify suspend_locking assumptions at runtime
  dm mpath: verify __pg_init_all_paths locking assumptions at runtime
  dm: introduce enum dm_queue_mode to cleanup related code
  dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH
  dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing 
assign_bit()
  dm mpath: make it easier to detect unintended I/O request flushes

Dennis Yang (1):
  dm thin: fix a memory leak when passing discard bio down

Eric Biggers (1):
  dm crypt: remove obsolete references to per-CPU state

Gilad Ben-Yossef (1):
  dm verity: switch to using asynchronous hash crypto API

Heinz Mauelshagen (3):
  md: add raid4/5/6 journal mode switching API
  dm raid: fix table line argument order in status
  dm raid: add raid4/5/6 journal write-back support via journal_mode option

Joe Thornber (4):
  dm bio prison v2: new interface for the bio prison
  dm cache: significant rework to leverage dm-bio-prison-v2
  dm cache: set/clear the cache core's dirty_bitset when loading mappings
  dm cache policy smq: make the cleaner policy write-back more aggressively

Matthias Kaehlcke (1):
  dm ioctl: remove double parentheses

Mike Snitzer (1):
  dm integrity: factor out create_journal() from dm_integrity_ctr()

Mikulas Patocka (15):
  dm bufio: add sector start offset to dm-bufio interface
  dm: add integrity target
  dm integrity: add recovery mode
  dm crypt: use shifts instead of sector_div
  dm raid: select the Kconfig option CONFIG_MD_RAID0
  dm table: replace while loops with for loops
  dm: mark targets that pass integrity data
  dm integrity: various small changes and cleanups
  dm integrity: support larger block sizes
  dm crypt: fix large block integrity support
  dm: remove dummy dm_table definition
  dm integrity: use hex2bin instead of open-coded variant
  dm integrity: use previously calculated log2 of sectors_per_block
  dm bufio: avoid a possible ABBA deadlock
  dm bufio: check new buffer allocation watermark every 30 seconds

Milan Br

[dm-devel] [PATCH 0/7] Fix fallout from changes to FUA and PREFLUSH definitions

2017-05-02 Thread Jan Kara
Hello,

this series addresses a performance issue caused by commit b685d3d65ac7 "block:
treat REQ_FUA and REQ_PREFLUSH as synchronous". We know for certain this
problem significanly regresses (over 10%, in some cases up to 100%) ext4 and
btrfs for dbench4 and reaim benchmarks.  Based on this I have fixed up also
other places which suffer from the same problem however those changes are
untested so maintainers please have a look whether the change makes sense to
you and also whether I possibly didn't miss some cases where REQ_SYNC should be
also added. Patches in this series are completely independent so if maintainers
agree with the change, feel free to take it through your tree.

The core of the problem is that above mentioned commit removed REQ_SYNC flag
from WRITE_{FUA|PREFLUSH|...} definitions.  generic_make_request_checks()
however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage
doesn't report volatile write cache and thus write effectively becomes
asynchronous which can lead to performance regressions.

A side note for ext4: The two patches for ext4 & jbd2 are on top of the change
that got merged in the ext4 tree already.

Honza

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 7/7] md: Make flush bios explicitely sync

2017-05-02 Thread Jan Kara
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

CC: linux-r...@vger.kernel.org
CC: Shaohua Li 
CC: Mike Snitzer 
CC: dm-devel@redhat.com
Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
Signed-off-by: Jan Kara 
---
 drivers/md/dm-snap-persistent.c | 3 ++-
 drivers/md/md.c | 2 +-
 drivers/md/raid5-cache.c| 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-snap-persistent.c b/drivers/md/dm-snap-persistent.c
index b93476c3ba3f..b92ab4cb0710 100644
--- a/drivers/md/dm-snap-persistent.c
+++ b/drivers/md/dm-snap-persistent.c
@@ -741,7 +741,8 @@ static void persistent_commit_exception(struct 
dm_exception_store *store,
/*
 * Commit exceptions to disk.
 */
-   if (ps->valid && area_io(ps, REQ_OP_WRITE, REQ_PREFLUSH | REQ_FUA))
+   if (ps->valid && area_io(ps, REQ_OP_WRITE,
+REQ_SYNC | REQ_PREFLUSH | REQ_FUA))
ps->valid = 0;
 
/*
diff --git a/drivers/md/md.c b/drivers/md/md.c
index f6ae1d67bcd0..9c40ce0642c2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -753,7 +753,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev 
*rdev,
test_bit(FailFast, &rdev->flags) &&
!test_bit(LastDev, &rdev->flags))
ff = MD_FAILFAST;
-   bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA | ff;
+   bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH | REQ_FUA | ff;
 
atomic_inc(&mddev->pending_writes);
submit_bio(bio);
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be01b10..0bd5d0c88cee 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1625,7 +1625,7 @@ static int r5l_log_write_empty_meta_block(struct r5l_log 
*log, sector_t pos,
mb->checksum = cpu_to_le32(crc32c_le(log->uuid_checksum,
 mb, PAGE_SIZE));
if (!sync_page_io(log->rdev, pos, PAGE_SIZE, page, REQ_OP_WRITE,
- REQ_FUA, false)) {
+ REQ_SYNC | REQ_FUA, false)) {
__free_page(page);
return -EIO;
}
@@ -2198,7 +2198,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log 
*log,
mb->checksum = cpu_to_le32(crc32c_le(log->uuid_checksum,
 mb, PAGE_SIZE));
sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page,
-REQ_OP_WRITE, REQ_FUA, false);
+REQ_OP_WRITE, REQ_SYNC | REQ_FUA, false);
sh->log_start = ctx->pos;
list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
atomic_inc(&log->stripe_in_journal_count);
-- 
2.12.0

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.

2017-05-02 Thread Peter Rajnoha
On 05/01/2017 06:35 AM, NeilBrown wrote:
> On Fri, Apr 28 2017, Peter Rajnoha wrote:
> 
>> On 04/28/2017 05:55 AM, NeilBrown wrote:
>>> On Wed, Apr 26 2017, Peter Rajnoha wrote:
>>>
 On 04/20/2017 11:35 PM, NeilBrown wrote:
> If we wanted an more permanent udev rule, we would need to record the
> devices that should be ignored in the filesystem somewhere else.
> Maybe in /run/mdadm.
> e.g.
>
>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
>
> Then we could have something like the following (untested) in mdadm.
> Does that seem more suitable?
>

 Yes, please, if possible, go for a permanent udev rule surely - this
 will make it much easier for other foreign tools to hook in properly if
 needed and it would also be much easier to debug.
>>>
>>> I'm leaning towards the files-in-/run/mdadm approach too.  I'll make a
>>> proper patch.
>>>

 But, wouldn't it be better if we could just pass this information ("not
 initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
 driver in kernel could add the variable to the uevent it generates which
 userspace udev rules could check for easily. This way, we don't need to
 hassle with creating files in filesystem and the information would be
 directly a part of the uevent the md kernel driver generates (so
 directly accessible in udev rules too). Also, possibly adding more
 variables for other future scenarios if needed.
>>>
>>> When would we clear the "not initialised yet" flag in the kernel, and
>>> how?  And would that be enough.
>>
>> The flag wouldn't be stored in kernel, md kernel driver would just pass
>> the flag with the uevent as it received in with ioctl/sysfs request to
>> create a new dev. The udev in userspace would handle the state
>> transition then from "flagged as not-initialized" state to "initilized"
>> by checking the sequence of events as they come.
>>
>> We should have this sequence I assume:
>>
>>   1) "add" (creating dev, not usable yet)
>>   2) "change" (activated dev, but not initialized yet)
>>   3) "synthetic change" (after wiping the dev and closing it, the WATCH
>> rule fires)
>>
> 
> "Should" is arguable, but there are no guarantees of this sequence.
> 
> A particular cause of irregular sequencing is when an array is assembled
> by the initrd, then after the real root is mounted, something runs
>   udevadm trigger --action=add
> (e.g. systemd-udev-triggger.service).
> 
> In that case, 'add' is the first and only message that the
> full-root-available udev sees, so it is not safe to ignore the array as
> "not usable yet".
> 
> 

As for initrd case, there's OPTIONS+="db_persist" rule which is already
used by dracut to keep udev db records when switching from initrd to
root fs and replacing the udev daemon instance (other devices which do
not have this option set will have udev db records cleared when switched
to root fs). Not sure why it's not documented in udev yet (I already
asked for that years ago), but the support is definitely there. So
there's already a way to keep these records and to not lose track of the
current state when switching from initrd. This makes it still possible
to detect whether the synthetic "add" (or "change") event is sent before
or after proper device initialization and to make it possible to react
on the trigger (with synthetic uevents generated) still while ignoring
the "add" event if that comes from the kernel before the device is
initialized. And even if we're switching from initrd to root fs and
restarting udev daemon.

>>>
>>> When mdadm creates an md array, at least 3 uevents get generated.
>>> The first is generated when the md device object is created, either by
>>> opening the /dev/mdXX file, or by writing magic to somewhere in sysfs.
>>> The second is generated when the array is started and the content is
>>> visible.
>>> The third is generated when mdadm closes the file descriptor.  It opens
>>> /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it.
>>> Because udev uses inotify to watch for a 'close for a writable file
>>> descriptor', this generates another event.
>>>
>>> We need to ensure that none of these cause udev to run anything that
>>> inspects the contents of the array.
>>> Of the three, only the second one is directly under the control of the
>>> md module, so only that one can add an environment variable.
>>>
>>> It might be possible to avoid the last one (by not opening for write),
>>> but I cannot see a way to avoid the first one.
>>
>> So the first event is the "add" event ("md device object created") - in
>> this case, the device is not yet usable anyway I suppose, so we can skip
>> udev scans for this event right away (unless it's the synthetic "add"
>> event which is generated by writing "add" to /sys/block/.../uevent file
>> or alternatively using udevadm trigger - but we should be able to
>> recognize this event "synthetic add even

Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-02 Thread Nicholas A. Bellinger
On Mon, 2017-05-01 at 23:43 -0700, Nicholas A. Bellinger wrote:
> On Mon, 2017-05-01 at 20:45 +, Bart Van Assche wrote:
> > On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > > kill this hack.
> > > 
> > > Signed-off-by: Christoph Hellwig 
> > > Reviewed-by: Martin K. Petersen 
> > > Reviewed-by: Hannes Reinecke 
> > > [ ... ]
> > > diff --git a/drivers/target/target_core_device.c 
> > > b/drivers/target/target_core_device.c
> > > index c754ae33bf7b..d2f089cfa9ae 100644
> > > --- a/drivers/target/target_core_device.c
> > > +++ b/drivers/target/target_core_device.c
> > > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct 
> > > se_dev_attrib *attrib,
> > >   attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> > >   attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> > >   block_size;
> > > - attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > > + attrib->unmap_zeroes_data = 0;
> > >   return true;
> > >  }
> > >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> > 
> > Hello Christoph,
> > 
> > Sorry that I hadn't noticed this before but I think that this patch
> > introduces a significant performance regressions for LIO users. Before
> > this patch the LBPRZ flag was reported correctly to initiator systems
> > through the thin provisioning VPD. With this patch applied that flag
> > will always be reported as zero, forcing initiators to submit WRITE
> > commands with zeroed data buffers instead of submitting the SCSI UNMAP
> > command to block devices for which discard_zeroes_data was set. From
> > target_core_spc.c:
> > 
> > /* Thin Provisioning VPD */
> > static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char 
> > *buf)
> > {
> > [ ... ]
> > /*
> >  * The unmap_zeroes_data set means that the underlying device supports
> >  * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
> >  * the SBC requirements for LBPRZ, meaning that a subsequent read
> >  * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
> >  * See sbc4r36 6.6.4.
> >  */
> > if (((dev->dev_attrib.emulate_tpu != 0) ||
> >  (dev->dev_attrib.emulate_tpws != 0)) &&
> >  (dev->dev_attrib.unmap_zeroes_data != 0))
> > buf[5] |= 0x04;
> > [ ... ]
> > }
> > 
> 
> According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
> SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
> For UNMAP, q->limits.discard_zeroes_data was never set.
> 
> That said, it's pretty much implied that supporting DISCARD means
> subsequent READs return zeros, so target_configure_unmap_from_queue()
> should be setting attrib->unmap_zeroes_data = 1, or just dropping it
> all-together.
> 
> Post -rc1, I'll push a patch to do the latter.
> 

Or, another options is use bdev_write_zeroes_sectors() to determine when
dev_attrib->unmap_zeroes_data should be set.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 07/21] crypto: shash, caam: Make use of the new sg_map helper function

2017-05-02 Thread Logan Gunthorpe


On 28/04/17 11:51 AM, Herbert Xu wrote:
> On Fri, Apr 28, 2017 at 10:53:45AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 28/04/17 12:30 AM, Herbert Xu wrote:
>>> You are right.  Indeed the existing code looks buggy as they
>>> don't take sg->offset into account when doing the kmap.  Could
>>> you send me some patches that fix these problems first so that
>>> they can be easily backported?
>>
>> Ok, I think the only buggy one in crypto is hifn_795x. Shash and caam
>> both do have the sg->offset accounted for. I'll send a patch for the
>> buggy one shortly.
> 
> I think they're all buggy when sg->offset is greater than PAGE_SIZE.

Yes, technically. But that's a _very_ common mistake. Pretty nearly
every case I looked at did not take that into account. I don't think
sg's that point to more than one continuous page are all that common.

Fixing all those cases without making a common function is a waste of
time IMO.

Logan

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-02 Thread Nicholas A. Bellinger
On Mon, 2017-05-01 at 20:45 +, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > kill this hack.
> > 
> > Signed-off-by: Christoph Hellwig 
> > Reviewed-by: Martin K. Petersen 
> > Reviewed-by: Hannes Reinecke 
> > [ ... ]
> > diff --git a/drivers/target/target_core_device.c 
> > b/drivers/target/target_core_device.c
> > index c754ae33bf7b..d2f089cfa9ae 100644
> > --- a/drivers/target/target_core_device.c
> > +++ b/drivers/target/target_core_device.c
> > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct 
> > se_dev_attrib *attrib,
> > attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> > attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> > block_size;
> > -   attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > +   attrib->unmap_zeroes_data = 0;
> > return true;
> >  }
> >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> 
> Hello Christoph,
> 
> Sorry that I hadn't noticed this before but I think that this patch
> introduces a significant performance regressions for LIO users. Before
> this patch the LBPRZ flag was reported correctly to initiator systems
> through the thin provisioning VPD. With this patch applied that flag
> will always be reported as zero, forcing initiators to submit WRITE
> commands with zeroed data buffers instead of submitting the SCSI UNMAP
> command to block devices for which discard_zeroes_data was set. From
> target_core_spc.c:
> 
> /* Thin Provisioning VPD */
> static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char 
> *buf)
> {
>   [ ... ]
>   /*
>* The unmap_zeroes_data set means that the underlying device supports
>* REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
>* the SBC requirements for LBPRZ, meaning that a subsequent read
>* will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
>* See sbc4r36 6.6.4.
>*/
>   if (((dev->dev_attrib.emulate_tpu != 0) ||
>(dev->dev_attrib.emulate_tpws != 0)) &&
>(dev->dev_attrib.unmap_zeroes_data != 0))
>   buf[5] |= 0x04;
>   [ ... ]
> }
> 

According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
For UNMAP, q->limits.discard_zeroes_data was never set.

That said, it's pretty much implied that supporting DISCARD means
subsequent READs return zeros, so target_configure_unmap_from_queue()
should be setting attrib->unmap_zeroes_data = 1, or just dropping it
all-together.

Post -rc1, I'll push a patch to do the latter.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 07/21] crypto: shash, caam: Make use of the new sg_map helper function

2017-05-02 Thread Logan Gunthorpe


On 28/04/17 12:30 AM, Herbert Xu wrote:
> You are right.  Indeed the existing code looks buggy as they
> don't take sg->offset into account when doing the kmap.  Could
> you send me some patches that fix these problems first so that
> they can be easily backported?

Ok, I think the only buggy one in crypto is hifn_795x. Shash and caam
both do have the sg->offset accounted for. I'll send a patch for the
buggy one shortly.

Logan

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-02 Thread h...@lst.de
On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> Or, another options is use bdev_write_zeroes_sectors() to determine when
> dev_attrib->unmap_zeroes_data should be set.

Yes, that in combination with your patch to use bdev_write_zeroes_sectors
for zeroing from write same seems like the right fix.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel