Re: [dm-devel] [PATCH v2 10/18] multipathd: delay reloads during creation
On Fri, Apr 08, 2016 at 10:36:07AM +0200, Zdenek Kabelac wrote: > However now the multipathd *kills* this assumption - since the current udev > rules implementation for multipath devices targets only for the initial scan > and all subsequent RESUMES are supposed to be ignored as it's believed the > device remained in the same state and only some paths have been added/lost. > Scanning such a device thus shall not change any remembered info in udev > database. As 'extra' bonus multipath may SUSPEND devices (and that's somehow > solved by this patch) right after the initial activation of the device so > the lvm2 check for skipping of suspended devices may have skipped the whole > discovery operation and since further RESUMES were meant to be ignore, > device remained invisible forever. Yep, that's the general idea of the patch, to avoid the case where the initial activation uevent isn't getting processed while the device is suspended. > Now we get to the best technical solution for multipath here with other > surrounding software (i.e. udev) - before multipath starts to mark RESUMES > as 'ignorable' it should check/validate if udevdb already does contain a > valid information about the device (i.e. it's been scanned...) and only in > this case it would mark this device to be ignored. This makes sense, and has the benefit of not forcing multipathd to wait before doing reloads, or having to deal with what happens if the initial change uevent gets backed up. The only thing I'm not sure we have is a variable to check in the udev database that will get set (and then retained over subsequent uevents) when the device is scanned, regardless of whether any lvm metadata is found or not. Peter, does anything like this exist? I didn't see anything that looked like what we'd need. If nothing currently exists, could we add a variable that gets added to the udev database for a device whenever pvscan is run, and also gets imported from the database on all change events? -Ben > Regards > > Zdenek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH] dm service time: measure service time rather than approximate it
On Fri, Apr 08 2016 at 3:03pm -0400, Mike Snitzerwrote: > On Fri, Apr 08 2016 at 2:58pm -0400, > Mike Snitzer wrote: > > > The DM multipath service-time path-selector has historically tracked the > > amount of outstanding IO per path and used that to approximate the > > service-time of each path. In practice this has shown itself to work > > fairly well but we can do better by measuring the actual service-time > > during IO completion and using it as the basis for path selection. > > > > Measuring the actual service-time is still prone to inaccuracies given > > that service-times vary with IO size. But to counter any potential for > > drawing incorrect conclusions about the service-times of a given path > > the measured service-times are reset periodically. > > > > This approach has provided a 10% increase in the selection of a path > > that was forcibly made to be less loaded than the alternative path. > > > > Reported-by: Todd Gill > > Signed-off-by: Mike Snitzer > > It should be noted that I have not looked at the implications on actual > throughput or system load. But I wanted to get this RFC out to see what > others thought about making dm-service-time more intuitive in its > implementation. I have notice fio's total and completion latency ('lat' and 'clat') go up on this simple SAS testbed: before: write: io=345920KB, bw=34379KB/s, iops=537, runt= 10062msec slat (usec): min=10, max=47, avg=22.51, stdev= 3.71 clat (msec): min=1, max=146, avg=59.50, stdev=11.84 lat (msec): min=1, max=146, avg=59.52, stdev=11.84 after: write: io=347456KB, bw=34545KB/s, iops=539, runt= 10058msec slat (usec): min=6, max=46, avg=20.50, stdev= 3.68 clat (usec): min=385, max=146556, avg=59219.94, stdev=11580.00 lat (usec): min=403, max=146573, avg=59240.87, stdev=11580.57 Which obviously isn't what we want (might speak to why Junichi decided to approximate service-time)... -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH] dm service time: measure service time rather than approximate it
On Fri, Apr 08 2016 at 2:58pm -0400, Mike Snitzerwrote: > The DM multipath service-time path-selector has historically tracked the > amount of outstanding IO per path and used that to approximate the > service-time of each path. In practice this has shown itself to work > fairly well but we can do better by measuring the actual service-time > during IO completion and using it as the basis for path selection. > > Measuring the actual service-time is still prone to inaccuracies given > that service-times vary with IO size. But to counter any potential for > drawing incorrect conclusions about the service-times of a given path > the measured service-times are reset periodically. > > This approach has provided a 10% increase in the selection of a path > that was forcibly made to be less loaded than the alternative path. > > Reported-by: Todd Gill > Signed-off-by: Mike Snitzer It should be noted that I have not looked at the implications on actual throughput or system load. But I wanted to get this RFC out to see what others thought about making dm-service-time more intuitive in its implementation. Junichi, it'd also be great if you could provide historic context for why you elected to approximate the service-time based on outstanding IO rather than measure the actual service-time of each path. Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [RFC PATCH] dm service time: measure service time rather than approximate it
The DM multipath service-time path-selector has historically tracked the amount of outstanding IO per path and used that to approximate the service-time of each path. In practice this has shown itself to work fairly well but we can do better by measuring the actual service-time during IO completion and using it as the basis for path selection. Measuring the actual service-time is still prone to inaccuracies given that service-times vary with IO size. But to counter any potential for drawing incorrect conclusions about the service-times of a given path the measured service-times are reset periodically. This approach has provided a 10% increase in the selection of a path that was forcibly made to be less loaded than the alternative path. Reported-by: Todd GillSigned-off-by: Mike Snitzer --- drivers/md/dm-mpath.c | 6 ++- drivers/md/dm-path-selector.h | 5 ++- drivers/md/dm-queue-length.c | 4 +- drivers/md/dm-service-time.c | 101 +++--- 4 files changed, 35 insertions(+), 81 deletions(-) diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 52baf8a..fb37dff 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -105,6 +105,7 @@ struct multipath { struct dm_mpath_io { struct pgpath *pgpath; size_t nr_bytes; + ktime_t io_start_time; }; typedef int (*action_fn) (struct pgpath *pgpath); @@ -507,7 +508,7 @@ static int __multipath_map(struct dm_target *ti, struct request *clone, if (pgpath->pg->ps.type->start_io) pgpath->pg->ps.type->start_io(>pg->ps, >path, - nr_bytes); + nr_bytes, >io_start_time); return DM_MAPIO_REMAPPED; } @@ -1374,7 +1375,8 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, if (pgpath) { ps = >pg->ps; if (ps->type->end_io) - ps->type->end_io(ps, >path, mpio->nr_bytes); + ps->type->end_io(ps, >path, mpio->nr_bytes, +>io_start_time); } clear_request_fn_mpio(m, map_context); diff --git a/drivers/md/dm-path-selector.h b/drivers/md/dm-path-selector.h index b6eb536..c09d2bb 100644 --- a/drivers/md/dm-path-selector.h +++ b/drivers/md/dm-path-selector.h @@ -13,6 +13,7 @@ #defineDM_PATH_SELECTOR_H #include +#include #include "dm-mpath.h" @@ -72,9 +73,9 @@ struct path_selector_type { status_type_t type, char *result, unsigned int maxlen); int (*start_io) (struct path_selector *ps, struct dm_path *path, -size_t nr_bytes); +size_t nr_bytes, ktime_t *io_start_time); int (*end_io) (struct path_selector *ps, struct dm_path *path, - size_t nr_bytes); + size_t nr_bytes, ktime_t *io_start_time); }; /* Register a path selector */ diff --git a/drivers/md/dm-queue-length.c b/drivers/md/dm-queue-length.c index 23f1786..4affb39 100644 --- a/drivers/md/dm-queue-length.c +++ b/drivers/md/dm-queue-length.c @@ -217,7 +217,7 @@ out: } static int ql_start_io(struct path_selector *ps, struct dm_path *path, - size_t nr_bytes) + size_t nr_bytes, ktime_t *io_start_time) { struct path_info *pi = path->pscontext; @@ -227,7 +227,7 @@ static int ql_start_io(struct path_selector *ps, struct dm_path *path, } static int ql_end_io(struct path_selector *ps, struct dm_path *path, -size_t nr_bytes) +size_t nr_bytes, ktime_t *io_start_time) { struct path_info *pi = path->pscontext; diff --git a/drivers/md/dm-service-time.c b/drivers/md/dm-service-time.c index 7b86420..1b3e45a 100644 --- a/drivers/md/dm-service-time.c +++ b/drivers/md/dm-service-time.c @@ -19,7 +19,7 @@ #define ST_MAX_RELATIVE_THROUGHPUT 100 #define ST_MAX_RELATIVE_THROUGHPUT_SHIFT 7 #define ST_MAX_INFLIGHT_SIZE ((size_t)-1 >> ST_MAX_RELATIVE_THROUGHPUT_SHIFT) -#define ST_VERSION "0.3.0" +#define ST_VERSION "0.4.0" struct selector { struct list_head valid_paths; @@ -32,7 +32,8 @@ struct path_info { struct dm_path *path; unsigned repeat_count; unsigned relative_throughput; - atomic_t in_flight_size;/* Total size of in-flight I/Os */ + s64 service_time_usecs; + unsigned select_count; }; static struct selector *alloc_selector(void) @@ -92,7 +93,7 @@ static int st_status(struct path_selector *ps, struct dm_path *path, switch (type) { case STATUSTYPE_INFO: - DMEMIT("%d %u ", atomic_read(>in_flight_size), + DMEMIT("%u %u ", (unsigned)pi->service_time_usecs,
Re: [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
Ladies and Gentlemen, To show off some numbers from our testing: All tests are performed against the cache of the Array, not the disks as we wanted to test the Linux stack not the Disk Array. All single queue tests have been performed with the deadline I/O Scheduler. Comments welcome, have fun reading :-) QLogic 32GBit FC HBA NetApp EF560 Array w/ DM MQ this patchset: Random Read: BS | IOPS | BW| --++---+ 4k | 785136 | 3066.1MB/s | 8k | 734983 | 5742.6MB/s | 16k | 398516 | 6226.9MB/s | 32k | 200589 | 6268.5MB/s | 64k | 100417 | 6276.2MB/s | Sequential Read: BS | IOPS | BW| --++---+ 4k | 788620 | 3080.6MB/s | 8k | 736359 | 5752.9MB/s | 16k | 398597 | 6228.8MB/s | 32k | 200487| 6265.3MB/s | 64k | 100402 | 6275.2MB/s | Random Write: BS | IOPS | BW| --++---+ 4k | 242105 | 968423KB/s | 8k | 90 | 1736.7MB/s | 16k | 178191 | 2784.3MB/s | 32k | 133619 | 4175.7MB/s | 64k | 97693 | 6105.9MB/s | Sequential Write: BS | IOPS | BW| --++---+ 4k | 134788 | 539155KB/s | 8k | 132361 | 1034.8MB/s | 16k | 129941 | 2030.4MB/s | 32k | 128288 | 4009.4MB/s | 64k | 97776 | 6111.0MB/s | QLogic 32GBit FC HBA NetApp EF560 Array w/ DM SQ this patchset: Random Read: BS | IOPS | BW| --++---+ 4k | 112402 | 449608KB/s | 8k | 112818 | 902551KB/s | 16k | 111885 | 1748.3MB/s | 32k | 188015 | 5875.6MB/s | 64k | 99021 | 6188.9MB/s | Sequential Read: BS | IOPS | BW| --++---+ 4k | 115046 | 460186KB/s | 8k | 113974 | 911799KB/s | 16k | 113374 | 1771.5MB/s | 32k | 192932 | 6029.2MB/s | 64k | 100474 | 6279.7MB/s | Random Write: BS | IOPS | BW| --++---+ 4k | 114284 | 457138KB/s | 8k | 113992 | 911944KB/s | 16k | 113715 | 1776.9MB/s | 32k | 130402 | 4075.9MB/s | 64k | 92243 | 5765.3MB/s | Sequential Write: BS | IOPS | BW| --++---+ 4k | 115540 | 462162KB/s | 8k | 114243 | 913951KB/s | 16k | 300153 | 4689.1MB/s | 32k | 141069 | 4408.5MB/s | 64k | 97620 | 6101.3MB/s | QLogic 32GBit FC HBA NetApp EF560 Array w/ DM MQ previous patchset: Random Read: BS | IOPS | BW| --++---+ 4k | 782733 | 3057.6MB/s | 8k | 732143 | 5719.9MB/s | 16k | 398314 | 6223.7MB/s | 32k | 200538 | 6266.9MB/s | 64k | 100422 | 6276.5MB/s | Sequential Read: BS | IOPS | BW| --++---+ 4k | 786707 | 3073.8MB/s | 8k | 730579 | 5707.7MB/s | 16k | 398799 | 6231.3MB/s | 32k | 200518 | 6266.2MB/s | 64k | 100397 | 6274.9MB/s | Random Write: BS | IOPS | BW| --++---+ 4k | 242426 | 969707KB/s | 8k | 223079 | 1742.9MB/s | 16k | 177889 | 2779.6MB/s | 32k | 133637 | 4176.2MB/s | 64k | 97727 | 6107.1MB/s | Sequential Write: BS | IOPS | BW| --++---+ 4k | 134360 | 537442KB/s | 8k | 129738 | 1013.6MB/s | 16k | 129746 | 2027.3MB/s | 32k | 127875 | 3996.1MB/s | 64k | 97683 | 6105.3MB/s | Emulex 16GBit FC HBA NetApp EF560 Array w/ DM MQ this patchset : [Beware, this is with Hannes' lockless lpfc patches, which are not upstream as they're quite experimental, but are good at showing the capability of the new dm-mpath] Random Read: BS | IOPS | BW| --++---+ 4k | 939752 | 3670.1MB/s | 8k | 741462 | 5792.7MB/s | 16k | 399285 | 6238.9MB/s | 32k | 196490 | 6140.4MB/s | 64k | 100325 | 6270.4MB/s | Sequential Read: BS | IOPS | BW --++---+ 4k | 926222 | 3618.6MB/s | 8k | 750125 | 5860.4MB/s | 16k | 397770 | 6215.2MB/s | 32k | 200130 | 6254.8MB/s | 64k | 100397 | 6274.9MB/s | Random Write: BS | IOPS | BW| --++---+ 4k | 251938 | 984.14MB/s | 8k | 226712 | 1771.2MB/s | 16k | 180739 | 2824.5MB/s | 32k | 133316 | 4166.2MB/s | 64k | 98738 | 6171.2MB/s | Sequential Write: BS | IOPS | BW| --++---+ 4k | 134660 | 538643KB/s | 8k | 131585 | 1028.9MB/s | 16k | 131030 | 2047.4MB/s | 32k | 126987 | 3968.4MB/s | 64k | 98882 | 6180.2MB/s | Emulex 16GBit FC HBA NetApp EF560 Array w/ DM SQ this patchset: Random Read: BS | IOPS | BW| --++---+ 4k | 101860 |
Re: [dm-devel] [PATCH v2 10/18] multipathd: delay reloads during creation
Dne 8.4.2016 v 01:20 Benjamin Marzinski napsal(a): lvm needs PV devices to not be suspended while the udev rules are running, for them to be correctly identified as PVs. However, multipathd will often be in a situation where it will create a multipath device upon seeing a path, and then immediately reload the device upon seeing another path. If multipath is reloading a device while processing the udev event from its creation, lvm can fail to identify it as a PV. This can cause systems to fail to boot. Unfortunately, using udev synchronization cookies to solve this issue would cause a host of other issues that could only be avoided by a pretty substantial change in how multipathd does locking and event processing. This is somewhat misunderstanding of the core problem it's not about 'lvm2 needs to be not suspended'. So let me elaborate here with more details. (And Peter please fix me in case of mistakes ;) Lvm2 command has lvm.conf settings allowing a user to select if he wants to scan devices that are suspended or not - there is already a 'race' - since checking device is not suspended and then opening it has lots of room for the race, but it's not a major issue in this case. The reasons to skip reading suspended devices are primarily to avoid holding VG lock for potentially a very long time and also avoiding udev with its 'built-in' 30sec timeout to kills its worker process blocked in blkid scan with a danger of marking device as GONE and having further consequences like an automated umount by systemd Thus lvm2/dm udev rules implemented a (racy) check for skipping a read of suspended device and this check may also skip the call of pvscan with the generic assumption RESUME goes afterwards with a CHANGE event and device will be properly scanned anyway - so there would be no info lost - only gets into udev database later. However now the multipathd *kills* this assumption - since the current udev rules implementation for multipath devices targets only for the initial scan and all subsequent RESUMES are supposed to be ignored as it's believed the device remained in the same state and only some paths have been added/lost. Scanning such a device thus shall not change any remembered info in udev database. As 'extra' bonus multipath may SUSPEND devices (and that's somehow solved by this patch) right after the initial activation of the device so the lvm2 check for skipping of suspended devices may have skipped the whole discovery operation and since further RESUMES were meant to be ignore, device remained invisible forever. Now we get to the best technical solution for multipath here with other surrounding software (i.e. udev) - before multipath starts to mark RESUMES as 'ignorable' it should check/validate if udevdb already does contain a valid information about the device (i.e. it's been scanned...) and only in this case it would mark this device to be ignored. This may of course mean there will be few more extra initial repeated scans - but it's the only 'safest' way to proceed (i.e. you can't resolve the problem with cookie waiting on loaded system - since the udev 30sec timeout is unpredictable) Regards Zdenek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel