Re: [dm-devel] [PATCH v2 10/18] multipathd: delay reloads during creation

2016-04-08 Thread Benjamin Marzinski
On Fri, Apr 08, 2016 at 10:36:07AM +0200, Zdenek Kabelac wrote:
> However now the multipathd *kills* this assumption - since the current udev
> rules implementation for multipath devices targets only for the initial scan
> and all subsequent RESUMES are supposed to be ignored as it's believed the
> device remained in the same state and only some paths have been added/lost.
> Scanning such a device thus shall not change any remembered info in udev
> database. As 'extra' bonus multipath may SUSPEND devices (and that's somehow
> solved by this patch) right after the initial activation of the device so
> the lvm2 check for skipping of suspended devices may have skipped the whole
> discovery operation and since further RESUMES were meant to be ignore,
> device remained invisible forever.

Yep, that's the general idea of the patch, to avoid the case where the
initial activation uevent isn't getting processed while the device is
suspended.
 
> Now we get to the best technical solution for multipath here with other
> surrounding software (i.e. udev) - before multipath starts to mark RESUMES
> as 'ignorable' it should check/validate if udevdb already does contain a
> valid information about the device (i.e. it's been scanned...)  and only in
> this case it would mark this device to be ignored.

This makes sense, and has the benefit of not forcing multipathd to wait
before doing reloads, or having to deal with what happens if the initial
change uevent gets backed up. The only thing I'm not sure we have is a
variable to check in the udev database that will get set (and then
retained over subsequent uevents) when the device is scanned, regardless
of whether any lvm metadata is found or not. Peter, does anything like
this exist? I didn't see anything that looked like what we'd need. If
nothing currently exists, could we add a variable that gets added to the
udev database for a device whenever pvscan is run, and also gets
imported from the database on all change events?

-Ben

> Regards
> 
> Zdenek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] dm service time: measure service time rather than approximate it

2016-04-08 Thread Mike Snitzer
On Fri, Apr 08 2016 at  3:03pm -0400,
Mike Snitzer  wrote:

> On Fri, Apr 08 2016 at  2:58pm -0400,
> Mike Snitzer  wrote:
> 
> > The DM multipath service-time path-selector has historically tracked the
> > amount of outstanding IO per path and used that to approximate the
> > service-time of each path.  In practice this has shown itself to work
> > fairly well but we can do better by measuring the actual service-time
> > during IO completion and using it as the basis for path selection.
> > 
> > Measuring the actual service-time is still prone to inaccuracies given
> > that service-times vary with IO size.  But to counter any potential for
> > drawing incorrect conclusions about the service-times of a given path
> > the measured service-times are reset periodically.
> > 
> > This approach has provided a 10% increase in the selection of a path
> > that was forcibly made to be less loaded than the alternative path.
> > 
> > Reported-by: Todd Gill 
> > Signed-off-by: Mike Snitzer 
> 
> It should be noted that I have not looked at the implications on actual
> throughput or system load.  But I wanted to get this RFC out to see what
> others thought about making dm-service-time more intuitive in its
> implementation.

I have notice fio's total and completion latency ('lat' and 'clat') go
up on this simple SAS testbed:

before:

 write: io=345920KB, bw=34379KB/s, iops=537, runt= 10062msec
slat (usec): min=10, max=47, avg=22.51, stdev= 3.71
clat (msec): min=1, max=146, avg=59.50, stdev=11.84
 lat (msec): min=1, max=146, avg=59.52, stdev=11.84

after:

  write: io=347456KB, bw=34545KB/s, iops=539, runt= 10058msec
slat (usec): min=6, max=46, avg=20.50, stdev= 3.68
clat (usec): min=385, max=146556, avg=59219.94, stdev=11580.00
 lat (usec): min=403, max=146573, avg=59240.87, stdev=11580.57

Which obviously isn't what we want (might speak to why Junichi decided
to approximate service-time)...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] dm service time: measure service time rather than approximate it

2016-04-08 Thread Mike Snitzer
On Fri, Apr 08 2016 at  2:58pm -0400,
Mike Snitzer  wrote:

> The DM multipath service-time path-selector has historically tracked the
> amount of outstanding IO per path and used that to approximate the
> service-time of each path.  In practice this has shown itself to work
> fairly well but we can do better by measuring the actual service-time
> during IO completion and using it as the basis for path selection.
> 
> Measuring the actual service-time is still prone to inaccuracies given
> that service-times vary with IO size.  But to counter any potential for
> drawing incorrect conclusions about the service-times of a given path
> the measured service-times are reset periodically.
> 
> This approach has provided a 10% increase in the selection of a path
> that was forcibly made to be less loaded than the alternative path.
> 
> Reported-by: Todd Gill 
> Signed-off-by: Mike Snitzer 

It should be noted that I have not looked at the implications on actual
throughput or system load.  But I wanted to get this RFC out to see what
others thought about making dm-service-time more intuitive in its
implementation.

Junichi, it'd also be great if you could provide historic context for
why you elected to approximate the service-time based on outstanding IO
rather than measure the actual service-time of each path.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [RFC PATCH] dm service time: measure service time rather than approximate it

2016-04-08 Thread Mike Snitzer
The DM multipath service-time path-selector has historically tracked the
amount of outstanding IO per path and used that to approximate the
service-time of each path.  In practice this has shown itself to work
fairly well but we can do better by measuring the actual service-time
during IO completion and using it as the basis for path selection.

Measuring the actual service-time is still prone to inaccuracies given
that service-times vary with IO size.  But to counter any potential for
drawing incorrect conclusions about the service-times of a given path
the measured service-times are reset periodically.

This approach has provided a 10% increase in the selection of a path
that was forcibly made to be less loaded than the alternative path.

Reported-by: Todd Gill 
Signed-off-by: Mike Snitzer 
---
 drivers/md/dm-mpath.c |   6 ++-
 drivers/md/dm-path-selector.h |   5 ++-
 drivers/md/dm-queue-length.c  |   4 +-
 drivers/md/dm-service-time.c  | 101 +++---
 4 files changed, 35 insertions(+), 81 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 52baf8a..fb37dff 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -105,6 +105,7 @@ struct multipath {
 struct dm_mpath_io {
struct pgpath *pgpath;
size_t nr_bytes;
+   ktime_t io_start_time;
 };
 
 typedef int (*action_fn) (struct pgpath *pgpath);
@@ -507,7 +508,7 @@ static int __multipath_map(struct dm_target *ti, struct 
request *clone,
if (pgpath->pg->ps.type->start_io)
pgpath->pg->ps.type->start_io(>pg->ps,
  >path,
- nr_bytes);
+ nr_bytes, >io_start_time);
return DM_MAPIO_REMAPPED;
 }
 
@@ -1374,7 +1375,8 @@ static int multipath_end_io(struct dm_target *ti, struct 
request *clone,
if (pgpath) {
ps = >pg->ps;
if (ps->type->end_io)
-   ps->type->end_io(ps, >path, mpio->nr_bytes);
+   ps->type->end_io(ps, >path, mpio->nr_bytes,
+>io_start_time);
}
clear_request_fn_mpio(m, map_context);
 
diff --git a/drivers/md/dm-path-selector.h b/drivers/md/dm-path-selector.h
index b6eb536..c09d2bb 100644
--- a/drivers/md/dm-path-selector.h
+++ b/drivers/md/dm-path-selector.h
@@ -13,6 +13,7 @@
 #defineDM_PATH_SELECTOR_H
 
 #include 
+#include 
 
 #include "dm-mpath.h"
 
@@ -72,9 +73,9 @@ struct path_selector_type {
   status_type_t type, char *result, unsigned int maxlen);
 
int (*start_io) (struct path_selector *ps, struct dm_path *path,
-size_t nr_bytes);
+size_t nr_bytes, ktime_t *io_start_time);
int (*end_io) (struct path_selector *ps, struct dm_path *path,
-  size_t nr_bytes);
+  size_t nr_bytes, ktime_t *io_start_time);
 };
 
 /* Register a path selector */
diff --git a/drivers/md/dm-queue-length.c b/drivers/md/dm-queue-length.c
index 23f1786..4affb39 100644
--- a/drivers/md/dm-queue-length.c
+++ b/drivers/md/dm-queue-length.c
@@ -217,7 +217,7 @@ out:
 }
 
 static int ql_start_io(struct path_selector *ps, struct dm_path *path,
-  size_t nr_bytes)
+  size_t nr_bytes, ktime_t *io_start_time)
 {
struct path_info *pi = path->pscontext;
 
@@ -227,7 +227,7 @@ static int ql_start_io(struct path_selector *ps, struct 
dm_path *path,
 }
 
 static int ql_end_io(struct path_selector *ps, struct dm_path *path,
-size_t nr_bytes)
+size_t nr_bytes, ktime_t *io_start_time)
 {
struct path_info *pi = path->pscontext;
 
diff --git a/drivers/md/dm-service-time.c b/drivers/md/dm-service-time.c
index 7b86420..1b3e45a 100644
--- a/drivers/md/dm-service-time.c
+++ b/drivers/md/dm-service-time.c
@@ -19,7 +19,7 @@
 #define ST_MAX_RELATIVE_THROUGHPUT 100
 #define ST_MAX_RELATIVE_THROUGHPUT_SHIFT   7
 #define ST_MAX_INFLIGHT_SIZE   ((size_t)-1 >> ST_MAX_RELATIVE_THROUGHPUT_SHIFT)
-#define ST_VERSION "0.3.0"
+#define ST_VERSION "0.4.0"
 
 struct selector {
struct list_head valid_paths;
@@ -32,7 +32,8 @@ struct path_info {
struct dm_path *path;
unsigned repeat_count;
unsigned relative_throughput;
-   atomic_t in_flight_size;/* Total size of in-flight I/Os */
+   s64 service_time_usecs;
+   unsigned select_count;
 };
 
 static struct selector *alloc_selector(void)
@@ -92,7 +93,7 @@ static int st_status(struct path_selector *ps, struct dm_path 
*path,
 
switch (type) {
case STATUSTYPE_INFO:
-   DMEMIT("%d %u ", atomic_read(>in_flight_size),
+   DMEMIT("%u %u ", (unsigned)pi->service_time_usecs,

Re: [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance

2016-04-08 Thread Johannes Thumshirn
Ladies and Gentlemen,
To show off some numbers from our testing:

All tests are performed against the cache of the Array, not the disks as we 
wanted to test the Linux stack not the Disk Array.

All single queue tests have been performed with the deadline I/O Scheduler.

Comments welcome, have fun reading :-)

QLogic 32GBit FC HBA NetApp EF560 Array w/ DM MQ this patchset:

Random Read:
BS   | IOPS | BW|
--++---+
4k   | 785136 | 3066.1MB/s |
8k   | 734983 | 5742.6MB/s |
16k | 398516 | 6226.9MB/s |
32k | 200589 | 6268.5MB/s |
64k | 100417 | 6276.2MB/s |

Sequential Read:
BS   | IOPS | BW|
--++---+
4k   | 788620 | 3080.6MB/s |
8k   | 736359 | 5752.9MB/s |
16k | 398597 | 6228.8MB/s |
32k |  200487| 6265.3MB/s |
64k | 100402 | 6275.2MB/s |

Random Write:
BS   | IOPS | BW|
--++---+
4k   | 242105 | 968423KB/s |
8k   | 90 | 1736.7MB/s |
16k | 178191 | 2784.3MB/s |
32k | 133619 | 4175.7MB/s |
64k |   97693 | 6105.9MB/s |

Sequential Write:
BS   | IOPS | BW|
--++---+
4k   | 134788 | 539155KB/s |
8k   | 132361 | 1034.8MB/s |
16k | 129941 | 2030.4MB/s |
32k | 128288 | 4009.4MB/s |
64k |   97776 | 6111.0MB/s |

QLogic 32GBit FC HBA NetApp EF560 Array w/ DM SQ this patchset:

Random Read:
BS   | IOPS | BW|
--++---+
4k   | 112402 | 449608KB/s |
8k   | 112818 | 902551KB/s |
16k | 111885 | 1748.3MB/s |
32k | 188015 | 5875.6MB/s |
64k |   99021 |  6188.9MB/s |

Sequential Read:
BS   | IOPS | BW|
--++---+
4k   | 115046 | 460186KB/s |
8k   | 113974 | 911799KB/s |
16k | 113374 | 1771.5MB/s |
32k | 192932 | 6029.2MB/s |
64k | 100474 | 6279.7MB/s |

Random Write:
BS   | IOPS | BW|
--++---+
4k   | 114284 | 457138KB/s |
8k   | 113992 | 911944KB/s |
16k | 113715 | 1776.9MB/s |
32k | 130402 | 4075.9MB/s |
64k |   92243 | 5765.3MB/s |

Sequential Write:
BS   | IOPS | BW|
--++---+
4k   | 115540 | 462162KB/s |
8k   | 114243 | 913951KB/s |
16k | 300153 | 4689.1MB/s |
32k | 141069 | 4408.5MB/s |
64k |   97620 | 6101.3MB/s |


QLogic 32GBit FC HBA NetApp EF560 Array w/ DM MQ previous patchset:

Random Read:
BS   | IOPS | BW|
--++---+
4k   | 782733 | 3057.6MB/s |
8k   | 732143 | 5719.9MB/s |
16k | 398314 | 6223.7MB/s |
32k | 200538 | 6266.9MB/s |
64k | 100422 | 6276.5MB/s |

Sequential Read:
BS   | IOPS | BW|
--++---+
4k   | 786707 | 3073.8MB/s |
8k   | 730579 | 5707.7MB/s |
16k | 398799 | 6231.3MB/s |
32k | 200518 | 6266.2MB/s |
64k | 100397 | 6274.9MB/s |

Random Write:
BS   | IOPS | BW|
--++---+
4k   | 242426 | 969707KB/s |
8k   | 223079 | 1742.9MB/s |
16k | 177889 | 2779.6MB/s |
32k | 133637 | 4176.2MB/s |
64k |   97727 | 6107.1MB/s |

Sequential Write:
BS   | IOPS | BW|
--++---+
4k   | 134360 | 537442KB/s |
8k   | 129738 | 1013.6MB/s |
16k | 129746 | 2027.3MB/s |
32k | 127875 | 3996.1MB/s |
64k |   97683 | 6105.3MB/s |

Emulex 16GBit FC HBA NetApp EF560 Array w/ DM MQ this patchset :

[Beware, this is with Hannes' lockless lpfc patches, which are not upstream as 
they're quite experimental, but are good at showing the capability of the new 
dm-mpath]

Random Read:
BS   | IOPS | BW|
--++---+
4k   | 939752 | 3670.1MB/s |
8k   | 741462 | 5792.7MB/s |
16k | 399285 | 6238.9MB/s |
32k | 196490 | 6140.4MB/s |
64k | 100325 | 6270.4MB/s |

Sequential Read:
BS   | IOPS | BW
--++---+
4k   | 926222 | 3618.6MB/s |
8k   | 750125 | 5860.4MB/s |
16k | 397770 | 6215.2MB/s |
32k | 200130 | 6254.8MB/s | 
64k | 100397 | 6274.9MB/s |

Random Write:
BS   | IOPS | BW|
--++---+
4k   | 251938 | 984.14MB/s |
8k   | 226712 | 1771.2MB/s |
16k | 180739 | 2824.5MB/s |
32k | 133316 | 4166.2MB/s |
64k |  98738  | 6171.2MB/s |

Sequential Write:
BS   | IOPS | BW|
--++---+
4k   | 134660 | 538643KB/s |
8k   | 131585 | 1028.9MB/s |
16k | 131030 | 2047.4MB/s |
32k | 126987 | 3968.4MB/s |
64k |  98882  | 6180.2MB/s |

Emulex 16GBit FC HBA NetApp EF560 Array w/ DM SQ this patchset:

Random Read:
BS   | IOPS | BW|
--++---+
4k   | 101860 | 

Re: [dm-devel] [PATCH v2 10/18] multipathd: delay reloads during creation

2016-04-08 Thread Zdenek Kabelac

Dne 8.4.2016 v 01:20 Benjamin Marzinski napsal(a):

lvm needs PV devices to not be suspended while the udev rules are
running, for them to be correctly identified as PVs. However, multipathd
will often be in a situation where it will create a multipath device
upon seeing a path, and then immediately reload the device upon seeing
another path.  If multipath is reloading a device while processing the
udev event from its creation, lvm can fail to identify it as a PV. This
can cause systems to fail to boot. Unfortunately, using udev
synchronization cookies to solve this issue would cause a host of other
issues that could only be avoided by a pretty substantial change in how
multipathd does locking and event processing.



This is somewhat misunderstanding of the core problem it's not about 'lvm2 
needs to be not suspended'. So let me elaborate here with more details.

(And Peter please fix me in case of mistakes ;)


Lvm2 command has lvm.conf settings allowing a user to select if he wants to 
scan devices that are suspended or not - there is already a 'race' - since 
checking device is not suspended and then opening it has lots of room for the 
race, but it's not a major issue in this case.



The reasons to skip reading suspended devices are primarily to avoid holding 
VG lock for potentially a very long time and also avoiding udev with its 
'built-in' 30sec timeout to kills its worker process blocked in blkid scan
with a danger of marking device as GONE and having further consequences like 
an automated umount by systemd



Thus lvm2/dm udev rules implemented a (racy) check for skipping a read of 
suspended device and this check may also skip the call of pvscan with the 
generic assumption  RESUME goes afterwards with a CHANGE event and device will 
be properly scanned anyway - so there would be no info lost - only gets into 
udev database later.



However now the multipathd *kills* this assumption - since the current udev 
rules implementation for multipath devices targets only for the initial scan 
and all subsequent RESUMES are supposed to be ignored as it's believed the 
device remained in the same state and only some paths have been added/lost. 
Scanning such a device thus shall not change any remembered info in udev 
database. As 'extra' bonus multipath may SUSPEND devices (and that's somehow 
solved by this patch) right after the initial activation of the device so the 
lvm2 check for skipping of suspended devices may have skipped the whole 
discovery operation and since further RESUMES were meant to be ignore, device 
remained invisible forever.


Now we get to the best technical solution for multipath here with other 
surrounding software (i.e. udev) - before multipath starts to mark RESUMES as 
'ignorable' it should check/validate if udevdb already does contain a valid 
information about the device (i.e. it's been scanned...)  and only in this 
case it would mark this device to be ignored.


This may of course mean there will be few more extra initial repeated scans - 
but it's the only 'safest' way to proceed (i.e. you can't resolve the problem 
with cookie waiting on loaded system - since the udev 30sec timeout is 
unpredictable)


Regards

Zdenek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel