Re: [dm-devel] [PATCH] dm-region-hash: fix strange usage of mempool_alloc.

2017-05-03 Thread NeilBrown
On Wed, May 03 2017, Mikulas Patocka wrote:

> On Mon, 24 Apr 2017, NeilBrown wrote:
>> 
>> I had a look at how the allocation 'dm_region' objects are used,
>> and it would take a bit of work to make it really safe.
>> My guess is __rh_find() should be allowed to fail, and the various
>> callers need to handle failure.
>> For example, dm_rh_inc_pending() would be given a second bio_list,
>> and would move any bios for which rh_inc() fails, onto that list.
>> Then do_writes() would merge that list back into ms->writes.
>> That way do_mirror() would not block indefinitely and forward progress
>> could be assured ... maybe.
>> It would take more work than I'm able to give at the moment, so
>> I'm happy to just drop this patch.
>> 
>> Thanks,
>> NeilBrown
>
> I think that the only way how to fix this would be to preallocate the all 
> the regions when the target is created.
>
> But, with the default region size 512kiB, it would cause high memory 
> consumption (approximatelly 1GB of RAM for 20TB device).

Two reflections:
 1/ This is close to what md/bitmap does.
   It actually uses a 2-level array for the 'pending' field from
   dm_region, combined with something similar to 'state'.
   The top level is allocated when the device is created.
   Entries in this table are either
 - pointers to a second-level array for 2048 regions
 - entries for 2 giant regions, 1024 times the normal size.

   So if we cannot allocate a page when we need that second level,
we just use an enormous region and so risk resync taking a bit
longer if there is a crash.

 2/ Even though md does pre-alloc to a degree, I'm not convinced that it
is necessary.
We only need a region to be recorded when it is actively being
written to, or when it is being recovered.
We could, in theory, have just one region that is written to and one
region that is being recovered.  If a writes request arrives for a
different region it blocks until the current region has no active
requests.  Then that region is forgotten and the new region
activated, and the new write allowed to proceed.
Obviously this would be horribly slow, but it should be
deadlock-free.
Using a mempool instead of a single region would then allow multiple
regions to be active in parallel, which would improve throughput
without affecting the deadlock status.

Maybe I'll try to code it and see what happens ... maybe not.

NeilBrown

 


signature.asc
Description: PGP signature
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [git pull] device mapper changes for 4.12

2017-05-03 Thread Mike Snitzer
On Wed, May 03 2017 at  1:51pm -0400,
Linus Torvalds  wrote:

> On Tue, May 2, 2017 at 11:58 AM, Mike Snitzer  wrote:
> >
> > - Add a new DM integrity target that emulates a block device that has
> >   additional per-sector tags that can be used for storing integrity
> >   information.
> 
> So this one comes with a nice documentation file and everything, but
> let me just quote all the help that you give users when faced with the
> CONFIG_DM_INTEGRITY choice:
> 
> CONFIG_DM_INTEGRITY:
> 
> This is the integrity target.
> 
> That's it. Not really very helpful.
> 
> The kernel configuration phase is already likely the most frustrating
> part of building your own kernel, it's really worth trying to make it
> a *bit* less frustrating than this.
> 
> Even just a stupid little blurb that says
> 
>"Emulates a block device that has additional per-sector tags that can
> be used for storing integrity information.
> 
> If you're not sure, you can probably just say 'n'"
> 
> or something like that. People who use this likely know they use it,
> and it should lower the stress level for others.

Sure thing.  Sorry for it being so terse.  I'll get it fixed and send it
your way end of this week or early next (waiting on a dm-cache fix from
Joe Thornber that I'll send at the same time).

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-region-hash: fix strange usage of mempool_alloc.

2017-05-03 Thread Mikulas Patocka


On Mon, 24 Apr 2017, NeilBrown wrote:

> On Fri, Apr 21 2017, Mikulas Patocka wrote:
> 
> > On Mon, 10 Apr 2017, NeilBrown wrote:
> >
> >> mempool_alloc() should only be called with GFP_ATOMIC when
> >> it is not safe to wait. Passing __GFP_NOFAIL to kmalloc()
> >> says that it is safe to wait indefinitely.  So this code is
> >> inconsistent.
> >> 
> >> Clearly it is OK to wait indefinitely in this code, and
> >> mempool_alloc() is able to do that.  So just use
> >> mempool_alloc, and allow it to sleep.  If no memory is
> >> available it will wait for something to be returned to the
> >> pool, and will retry a normal allocation regularly.
> >
> > The region hash code is buggy anyway, because it may allocate more entries 
> > than the size of the pool and not give them back.
> >
> > That kmalloc was introduced in the commit c06aad854 to work around a 
> > deadlock due to incorrect mempool usage.
> >
> > Your patch slightly increases the probability of the deadlock because 
> > mempool_alloc does all allocations with __GFP_NOMEMALLOC, so it uses 
> > higher limit than kmalloc(GFP_NOIO).
> >
> 
> Thanks for the review.
> 
> I had a look at how the allocation 'dm_region' objects are used,
> and it would take a bit of work to make it really safe.
> My guess is __rh_find() should be allowed to fail, and the various
> callers need to handle failure.
> For example, dm_rh_inc_pending() would be given a second bio_list,
> and would move any bios for which rh_inc() fails, onto that list.
> Then do_writes() would merge that list back into ms->writes.
> That way do_mirror() would not block indefinitely and forward progress
> could be assured ... maybe.
> It would take more work than I'm able to give at the moment, so
> I'm happy to just drop this patch.
> 
> Thanks,
> NeilBrown

I think that the only way how to fix this would be to preallocate the all 
the regions when the target is created.

But, with the default region size 512kiB, it would cause high memory 
consumption (approximatelly 1GB of RAM for 20TB device).

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [git pull] device mapper changes for 4.12

2017-05-03 Thread Linus Torvalds
On Tue, May 2, 2017 at 11:58 AM, Mike Snitzer  wrote:
>
> - Add a new DM integrity target that emulates a block device that has
>   additional per-sector tags that can be used for storing integrity
>   information.

So this one comes with a nice documentation file and everything, but
let me just quote all the help that you give users when faced with the
CONFIG_DM_INTEGRITY choice:

CONFIG_DM_INTEGRITY:

This is the integrity target.

That's it. Not really very helpful.

The kernel configuration phase is already likely the most frustrating
part of building your own kernel, it's really worth trying to make it
a *bit* less frustrating than this.

Even just a stupid little blurb that says

   "Emulates a block device that has additional per-sector tags that can
be used for storing integrity information.

If you're not sure, you can probably just say 'n'"

or something like that. People who use this likely know they use it,
and it should lower the stress level for others.

   Linus

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-03 Thread Mike Snitzer
On Tue, May 02 2017 at 11:33pm -0400,
Nicholas A. Bellinger  wrote:

> On Tue, 2017-05-02 at 09:23 +0200, h...@lst.de wrote:
> > On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > > dev_attrib->unmap_zeroes_data should be set.
> > 
> > Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> > for zeroing from write same seems like the right fix.
> 
> The larger target/iblock conversion patch looks like post v4.12 material
> at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
> plan to push the following patch post -rc1.
> 
> diff --git a/drivers/target/target_core_device.c 
> b/drivers/target/target_core_device.c
> index d2f089c..e7caf78 100644
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct 
> se_dev_attrib *attrib,
> attrib->unmap_granularity = q->limits.discard_granularity / 
> block_size;
> attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> block_size;
> -   attrib->unmap_zeroes_data = 0;
> +   attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
> return true;
>  }
>  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> 

Completely a nit but: why the extra parenthesis?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.

2017-05-03 Thread Peter Rajnoha
On 05/02/2017 03:42 PM, Jes Sorensen wrote:
> On 04/28/2017 01:05 AM, NeilBrown wrote:
>>
>> When an array is created the content is not initialized,
>> so it could have remnants of an old filesystem or md array
>> etc on it.
>> udev will see this and might try to activate it, which is almost
>> certainly not what is wanted.
>>
>> So create a mechanism for mdadm to communicate with udev to tell
>> it that the device isn't ready.  This mechanism is the existance
>> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
>>
>> When creating an array, mdadm will create the file.
>> A new udev rule file, 01-md-raid-creating.rules, will detect the
>> precense of thst file and set ENV{SYSTEMD_READY}="0".
>> This is fairly uniformly used to suppress actions based on the
>> contents of the device.
>>
>> Signed-off-by: NeilBrown 
>> ---
>>  Assemble.c  |  2 +-
>>  Build.c |  2 +-
>>  Create.c|  9 +++-
>>  Incremental.c   |  4 ++--
>>  Makefile|  4 ++--
>>  lib.c   | 29 +
>>  mdadm.h |  4 +++-
>>  mdopen.c| 52
>> -
>>  udev-md-raid-creating.rules |  7 ++
>>  9 files changed, 86 insertions(+), 27 deletions(-)
>>  create mode 100644 udev-md-raid-creating.rules
> 
> Applied!
> 

How is this solving the problem then? The patch is flagging a device as
not ready, then clearing the flag after some time. Where does the wiping
happen actually?

-- 
Peter

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.

2017-05-03 Thread Peter Rajnoha
On 05/02/2017 03:40 PM, Jes Sorensen wrote:
> On 05/02/2017 07:40 AM, Peter Rajnoha wrote:
>> On 05/01/2017 06:35 AM, NeilBrown wrote:
>>> On Fri, Apr 28 2017, Peter Rajnoha wrote:
 Then mdadm opens the devive, clears any old content/signatures the data
 area may contain, then closes it - this generates the third event -
 which is the "synthetic change" event (as a result of the inotify WATCH
 rule). And this one would drop the "not initialized" flag in udev db
 and
 the scans in udev are now enabled.
>>>
>>> Nope, it would be incorrect for mdadm to clear any old content.
>>> Sometimes people want to ignore old content.  Sometimes they very
>>> definitely want to use it.  It would be wrong for any code to try to
>>> guess what is wanted.
>>
>> The mdadm is not going to guess - it can have an option to
>> enable/disable the wiping on demand directly on command line (which is
>> also what is actually done in LVM).
> 
> I know the anaconda team keeps pushing for the nonsense of being able to
> wipe drives on creation. It is wrong, it is broken, and it is not going
> to happen.
> 

I'm not thinking about anaconda at the moment at all. It's just one of
the many users of mdadm. I'm thinking about a fix in general for all the
users which expect the device to be initialized properly when mdadm
--create returns.

>> Otherwise, if mdadm is not going to wipe/initialize the device itself,
>> then it needs to wait for the external tool to do it (based on external
>> configuration or on some manual wipefs-like call). So the sequence
>> would be:
>>
>>  1) mdadm creating the device
>>  2) mdadm setting up the device, marking it as "not initialized yet"
>>  4a) mdadm waiting for the external decision to be made about wiping part
>>  4b) external tool doing the wiping (or not) based on configuration or
>> user's will
>>  5) mdadm continuing and finishing when the wiping part is complete
>>
>> I expect mdadm to return only if the device is properly initialized
>> otherwise it's harder for other tools called after mdadm to start
>> working with the device - they need to poll for the state laboriously
>> and check for readiness.
> 
> What defines readiness? Some believe a raid1 array must be fully
> assembled with all members, other setups are happy to have one running
> drive in place.
> 

With "ready" I mean the time when it's safe to do a scan without seeing
old data (garbage) that may confuse udev hooks and udev event listeners.
That scan is done at some time - at the time the activating "change"
uevent comes and this rule does not pass
ATTR{md/array_state}=="|clear|inactive" (if it passes, the device is not
scanned yet).

> 4a/4b in your list here once again has no place in mdadm. Please kindly
> tell the anaconda team to go back and handle this properly instead.

The mdadm is creating the dev and so it should be responsible primarily
for providing a device which is cleared and ready for use without
causing any confusion on event-based system where various scans are
executed based on incoming udev events.

Alternatively, if mdadm is not going to be the place where the wiping
happens, I'd expect at least the sequence above (which is more complex,
yes, that's why I think having the wiping directly in mdadm is much
easier solution).

If you don't wipe the data and you don't give time for others to hook in
to do that, you make it harder for others (they need to deactivate all
the stack/garbage that is found in the data area after previous use).

Also, we can't reliably call wiping on the underlying components first,
because once they become MD components, the data are for the MD device
has an offset and new data range is revealed from the underlying devices
which may expose old signatures which were not visible on those
underlying devices before the MD device got created.

-- 
Peter

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.

2017-05-03 Thread Peter Rajnoha
On 05/02/2017 03:32 PM, Jes Sorensen wrote:
> On 04/28/2017 05:28 AM, Peter Rajnoha wrote:
>> On 04/28/2017 07:05 AM, NeilBrown wrote:
>>>
>>> When an array is created the content is not initialized,
>>> so it could have remnants of an old filesystem or md array
>>> etc on it.
>>> udev will see this and might try to activate it, which is almost
>>> certainly not what is wanted.
>>>
>>> So create a mechanism for mdadm to communicate with udev to tell
>>> it that the device isn't ready.  This mechanism is the existance
>>> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
>>>
>>> When creating an array, mdadm will create the file.
>>> A new udev rule file, 01-md-raid-creating.rules, will detect the
>>> precense of thst file and set ENV{SYSTEMD_READY}="0".
>>> This is fairly uniformly used to suppress actions based on the
>>> contents of the device.
>>
>> The scans in udev are primarily directed by blkid call which detects the
>> signatures and based on this information various other udev rules fire.
>>
>> The blkid as well as wipefs uses common libblkid library to detect these
>> signatures - is mdadm going to use libblkid to wipe the signatures on MD
>> device initialization or is it relying on external tools to do this? How
>> is mdadm actually initializing/wiping the newly created MD device?
> 
> mdadm doesn't wipe data and it isn't supposed to. Being able to create
> an array from drives with existing data is a feature.
> 
> It is the responsibility of the system administrator to handle drives
> with existing data, in the same way the administrator is expected to
> handle insertion of USB drives or external drives being powered on.

There's a difference though - when you're *creating* a completely new
device that is an abstraction over existing devices, you (most of the
time) expect that new device to be initialized. For those corner cases
where people do need to keep the old data, there can be an option to do
that. When you're inserting existing drives, you're not creating them -
when those device come from factory (they're "created"), they never
contain garbage and old data when you buy them.

-- 
Peter

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.

2017-05-03 Thread Jes Sorensen

On 04/28/2017 01:05 AM, NeilBrown wrote:


When an array is created the content is not initialized,
so it could have remnants of an old filesystem or md array
etc on it.
udev will see this and might try to activate it, which is almost
certainly not what is wanted.

So create a mechanism for mdadm to communicate with udev to tell
it that the device isn't ready.  This mechanism is the existance
of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.

When creating an array, mdadm will create the file.
A new udev rule file, 01-md-raid-creating.rules, will detect the
precense of thst file and set ENV{SYSTEMD_READY}="0".
This is fairly uniformly used to suppress actions based on the
contents of the device.

Signed-off-by: NeilBrown 
---
 Assemble.c  |  2 +-
 Build.c |  2 +-
 Create.c|  9 +++-
 Incremental.c   |  4 ++--
 Makefile|  4 ++--
 lib.c   | 29 +
 mdadm.h |  4 +++-
 mdopen.c| 52 -
 udev-md-raid-creating.rules |  7 ++
 9 files changed, 86 insertions(+), 27 deletions(-)
 create mode 100644 udev-md-raid-creating.rules


Applied!

I like this solution much better, even though I am not in love with the 
giant usleep() call. Would be nice to find a better way around that.


Sorry it took so long to get back to you on this, last week was a mess.

Thanks,
Jes


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-03 Thread Nicholas A. Bellinger
On Tue, 2017-05-02 at 09:23 +0200, h...@lst.de wrote:
> On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > dev_attrib->unmap_zeroes_data should be set.
> 
> Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> for zeroing from write same seems like the right fix.

The larger target/iblock conversion patch looks like post v4.12 material
at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
plan to push the following patch post -rc1.

diff --git a/drivers/target/target_core_device.c 
b/drivers/target/target_core_device.c
index d2f089c..e7caf78 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib 
*attrib,
attrib->unmap_granularity = q->limits.discard_granularity / block_size;
attrib->unmap_granularity_alignment = q->limits.discard_alignment /
block_size;
-   attrib->unmap_zeroes_data = 0;
+   attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
return true;
 }
 EXPORT_SYMBOL(target_configure_unmap_from_queue);

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.

2017-05-03 Thread Jes Sorensen

On 05/02/2017 07:40 AM, Peter Rajnoha wrote:

On 05/01/2017 06:35 AM, NeilBrown wrote:

On Fri, Apr 28 2017, Peter Rajnoha wrote:

Then mdadm opens the devive, clears any old content/signatures the data
area may contain, then closes it - this generates the third event -
which is the "synthetic change" event (as a result of the inotify WATCH
rule). And this one would drop the "not initialized" flag in udev db and
the scans in udev are now enabled.


Nope, it would be incorrect for mdadm to clear any old content.
Sometimes people want to ignore old content.  Sometimes they very
definitely want to use it.  It would be wrong for any code to try to
guess what is wanted.


The mdadm is not going to guess - it can have an option to
enable/disable the wiping on demand directly on command line (which is
also what is actually done in LVM).


I know the anaconda team keeps pushing for the nonsense of being able to 
wipe drives on creation. It is wrong, it is broken, and it is not going 
to happen.



Otherwise, if mdadm is not going to wipe/initialize the device itself,
then it needs to wait for the external tool to do it (based on external
configuration or on some manual wipefs-like call). So the sequence would be:

 1) mdadm creating the device
 2) mdadm setting up the device, marking it as "not initialized yet"
 4a) mdadm waiting for the external decision to be made about wiping part
 4b) external tool doing the wiping (or not) based on configuration or
user's will
 5) mdadm continuing and finishing when the wiping part is complete

I expect mdadm to return only if the device is properly initialized
otherwise it's harder for other tools called after mdadm to start
working with the device - they need to poll for the state laboriously
and check for readiness.


What defines readiness? Some believe a raid1 array must be fully 
assembled with all members, other setups are happy to have one running 
drive in place.


4a/4b in your list here once again has no place in mdadm. Please kindly 
tell the anaconda team to go back and handle this properly instead.


Jes

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.

2017-05-03 Thread Jes Sorensen

On 04/28/2017 05:28 AM, Peter Rajnoha wrote:

On 04/28/2017 07:05 AM, NeilBrown wrote:


When an array is created the content is not initialized,
so it could have remnants of an old filesystem or md array
etc on it.
udev will see this and might try to activate it, which is almost
certainly not what is wanted.

So create a mechanism for mdadm to communicate with udev to tell
it that the device isn't ready.  This mechanism is the existance
of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.

When creating an array, mdadm will create the file.
A new udev rule file, 01-md-raid-creating.rules, will detect the
precense of thst file and set ENV{SYSTEMD_READY}="0".
This is fairly uniformly used to suppress actions based on the
contents of the device.


The scans in udev are primarily directed by blkid call which detects the
signatures and based on this information various other udev rules fire.

The blkid as well as wipefs uses common libblkid library to detect these
signatures - is mdadm going to use libblkid to wipe the signatures on MD
device initialization or is it relying on external tools to do this? How
is mdadm actually initializing/wiping the newly created MD device?


mdadm doesn't wipe data and it isn't supposed to. Being able to create 
an array from drives with existing data is a feature.


It is the responsibility of the system administrator to handle drives 
with existing data, in the same way the administrator is expected to 
handle insertion of USB drives or external drives being powered on.


Jes

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel