Re: [libvirt] [PATCH v1 1/3] adding unplug_timeout QEMU conf
On 8/30/19 8:15 AM, Christophe de Dinechin wrote: Daniel P. Berrangé writes: The reason for this timeout is that we originally promised something that we cannot deliver - a synchronous device detach API, while the operation itself is asynchronous. I'm not a fan of exposing it and making it configurable. I'm especially *not* a fan because the commit messages says this is a problem on certain architectures. Since we know what those arches are, we should use a larger timeout for those arches out of the box. Requiring admin to set a config param to fix the architectures is super unpleasant out of the box experiance. True, but also notice that 5 seconds is also already close to the attention span time limit for users [1]. So increasing it to 10s might bring people to believe things are stuck, unless you provide some sort of feedback that this is normal. https://www.nngroup.com/articles/response-times-3-important-limits/ Interesting link, thanks. About the user feedback due to long response delay: we're already breaking this with the setvcpus command, at least with PowerPC guests and a lot of vcpus being unplugged. Here's an example in which I am able to complete the command without kicking the timeout error (guest is idle, vcpu unplug is fast in this case): --- guest booted with 1 vcpu - added 39 extra vcpus. Operation takes a second, it's fast --- [danielhb@kop5 libvirt]$ sudo ./run tools/virsh setvcpus vcpus-test 40 --live [danielhb@kop5 libvirt]$ [danielhb@kop5 libvirt]$ --- removing them back [danielhb@kop5 libvirt]$ time sudo ./run tools/virsh setvcpus vcpus-test 1 --live real 0m21.695s user 0m0.119s sys 0m0.000s [danielhb@kop5 libvirt]$ This happens in PowerPC because the timeout is being considered not for the whole operation, but per device. Since I'm unplugging 39 devices and the 5 seconds timeout is refreshed for every operation, in theory the user can wait close to 39*5 seconds with the terminal frozen. Now, if we are to adhere to such UX standards (IMO, we should), I propose the following: - short term: increase PowerPC timeout to 10 seconds per device. Following the UX guideline above, this is the limit we can go without warning the user about the delay; - short term: for PowerPC guests, tune 'setvcpus' message to warn the user that the operation can take some time to complete; These 2 are simples changes and I can get it done for the next release without too much trouble. - mid/long term: I can look into the PowerPC guest implementation, see if there are device_del events being fired up in QMP and implement a better UX with more information about how the process is going. Something like "vcpu 1 out of 30 unplugged", "vcpu 2 out of 30 unplugged", or a progress bar, or whatever makes more sense to give the user a feeling of operation ongoing. Note that I'm suggesting PowerPC only changes due to what Daniel said earlier - we can't impact other users due to something that, at first glance, only PowerPC does different. I have a hunch that we should do for all archs, but I can't defend this claim without testing this in x86 at least. These short-term changes are easy to make it across the board though,so it's just a matter of removing "if PowerPC" in these changes. What do you think? Thanks, DHB Regards, Daniel -- |:https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |:https://libvirt.org -o-https://fstop138.berrange.com :| |:https://entangle-photo.org -o-https://www.instagram.com/dberrange :| -- Cheers, Christophe de Dinechin (IRC c3d) -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v1 1/3] adding unplug_timeout QEMU conf
Daniel P. Berrangé writes: >> > >> >> The reason for this timeout is that we originally promised something >> that we cannot deliver - a synchronous device detach API, while the >> operation itself is asynchronous. I'm not a fan of exposing it and >> making it configurable. > > I'm especially *not* a fan because the commit messages says this is > a problem on certain architectures. Since we know what those arches > are, we should use a larger timeout for those arches out of the box. > Requiring admin to set a config param to fix the architectures is > super unpleasant out of the box experiance. True, but also notice that 5 seconds is also already close to the attention span time limit for users [1]. So increasing it to 10s might bring people to believe things are stuck, unless you provide some sort of feedback that this is normal. https://www.nngroup.com/articles/response-times-3-important-limits/ > > > Regards, > Daniel > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| -- Cheers, Christophe de Dinechin (IRC c3d) -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v1 1/3] adding unplug_timeout QEMU conf
On 8/30/19 5:24 AM, Daniel P. Berrangé wrote: On Mon, Aug 19, 2019 at 11:16:24AM +0200, Ján Tomko wrote: On Sun, Aug 18, 2019 at 06:45:29PM -0300, Daniel Henrique Barboza wrote: For some architectures and setups, device removal can take longer than the default 5 seconds. This results in commands such as 'virsh setvcpus' to fire timeout messages even if the actual operation happened in the guest, causing confusion for the user. The commit that introduced this error message: commit e3229f6e4461cd1721dc68a32e16ab1718ae716e qemu: hotplug: Add support for VCPU unplug specifically says that we treat this differently than regular device detach: As the new code is using device_del all the implications of using it are present. Contrary to the device deletion code, the vcpu deletion code fails if the unplug request is not executed in time. Technically, we already did execute the unplug request so we lie to the user saying "operation failed". Maybe we can revisit the decision? [cc-ing pkrempa who added this] This patch adds a new qemu.conf parameter called 'unplug_timeout' to handle these cases. If left unset, the current default timeout is used. To avoid user 'experimentation' with small timeouts, the current timeout is also the minimal value allowed. The reason for this timeout is that we originally promised something that we cannot deliver - a synchronous device detach API, while the operation itself is asynchronous. I'm not a fan of exposing it and making it configurable. I'm especially *not* a fan because the commit messages says this is a problem on certain architectures. Since we know what those arches are, we should use a larger timeout for those arches out of the box. Requiring admin to set a config param to fix the architectures is super unpleasant out of the box experiance. Good point. I'll re-send the series changing the timeout for PowerPC guests only. There's no need to impact all users for a problem that so far only impacts PPC. Thanks, DHB Regards, Daniel -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v1 1/3] adding unplug_timeout QEMU conf
On Mon, Aug 19, 2019 at 11:16:24AM +0200, Ján Tomko wrote: > On Sun, Aug 18, 2019 at 06:45:29PM -0300, Daniel Henrique Barboza wrote: > > For some architectures and setups, device removal can take > > longer than the default 5 seconds. This results in commands > > such as 'virsh setvcpus' to fire timeout messages even if > > the actual operation happened in the guest, causing confusion > > for the user. > > > > The commit that introduced this error message: >commit e3229f6e4461cd1721dc68a32e16ab1718ae716e >qemu: hotplug: Add support for VCPU unplug > > specifically says that we treat this differently than regular device > detach: > >As the new code is using device_del all the implications of using it >are present. Contrary to the device deletion code, the vcpu deletion >code fails if the unplug request is not executed in time. > > Technically, we already did execute the unplug request so we lie to the > user saying "operation failed". > > Maybe we can revisit the decision? [cc-ing pkrempa who added this] > > > This patch adds a new qemu.conf parameter called 'unplug_timeout' > > to handle these cases. If left unset, the current default > > timeout is used. To avoid user 'experimentation' with small > > timeouts, the current timeout is also the minimal value > > allowed. > > > > The reason for this timeout is that we originally promised something > that we cannot deliver - a synchronous device detach API, while the > operation itself is asynchronous. I'm not a fan of exposing it and > making it configurable. I'm especially *not* a fan because the commit messages says this is a problem on certain architectures. Since we know what those arches are, we should use a larger timeout for those arches out of the box. Requiring admin to set a config param to fix the architectures is super unpleasant out of the box experiance. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v1 1/3] adding unplug_timeout QEMU conf
Bump Does anyone else want to weight in on this? Jano suggested that exposing the timeout configuration is not the best plan, and I don't have a strong opinion about it. I even mentioned below that I'm ok with the idea of simply increasing the timeout to 15 seconds and changing the error message a bit to let users know that this is not an error per-se and the user should see the guest to see the result. If no one else have feelings for this matter, I'll simply re-sent the series with the change I just mentioned. I might also keep the unplug settings in the config object to get rid of qemu_hotplugpriv.h (patch 3/3). Thanks, DHB On 8/19/19 7:07 AM, Daniel Henrique Barboza wrote: Hi, On 8/19/19 6:16 AM, Ján Tomko wrote: On Sun, Aug 18, 2019 at 06:45:29PM -0300, Daniel Henrique Barboza wrote: For some architectures and setups, device removal can take longer than the default 5 seconds. This results in commands such as 'virsh setvcpus' to fire timeout messages even if the actual operation happened in the guest, causing confusion for the user. The commit that introduced this error message: commit e3229f6e4461cd1721dc68a32e16ab1718ae716e qemu: hotplug: Add support for VCPU unplug specifically says that we treat this differently than regular device detach: As the new code is using device_del all the implications of using it are present. Contrary to the device deletion code, the vcpu deletion code fails if the unplug request is not executed in time. Technically, we already did execute the unplug request so we lie to the user saying "operation failed". Maybe we can revisit the decision? [cc-ing pkrempa who added this] I have thought about making setvcpus asynchronous when it is a device_del operation. This would be more code (perhaps a new command to do that? Or a --asynchronous option?), and we can set user expectations properly by making it clear that this is a unplug request command, and the user will need to check the result personally in the guest. But then, if we are prepared to tell the user "go check yourself" we can simply change the current timeout message to say "vcpu unplug [...] timeout, check unplug status in the guest". This would be clearer, and the user wouldn't automatically assume that timeout == operation failed. Another thing we can do, instead of exposing the option to the user (which has a good potential for disaster - hence why I added a minimal value), is to simply set the timeout to a greater value (10? 15?) and be done with it. If we set the timeout to 15 seconds and change the timeout message to let the user know "we don't know, you'll need to check", like I said above, we have more resilience and will not alarm the user if a timeout still occurs. We'll also avoid exposing the timeout configuration like I'm doing here. This patch adds a new qemu.conf parameter called 'unplug_timeout' to handle these cases. If left unset, the current default timeout is used. To avoid user 'experimentation' with small timeouts, the current timeout is also the minimal value allowed. The reason for this timeout is that we originally promised something that we cannot deliver - a synchronous device detach API, while the operation itself is asynchronous. I'm not a fan of exposing it and making it configurable. Signed-off-by: Daniel Henrique Barboza --- src/qemu/libvirtd_qemu.aug | 3 +++ src/qemu/qemu.conf | 4 src/qemu/qemu_conf.c | 26 ++ src/qemu/qemu_conf.h | 5 + src/qemu/qemu_driver.c | 2 ++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 6 files changed, 41 insertions(+) [...] diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 0cbddd7a9c..29824e4e35 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -214,6 +214,8 @@ struct _virQEMUDriverConfig { gid_t swtpm_group; char **capabilityfilters; + + unsigned int unplugTimeout; }; /* Main driver state */ @@ -294,6 +296,9 @@ struct _virQEMUDriver { /* Immutable pointer, self-locking APIs */ virHashAtomicPtr migrationErrors; + + /* Immutable value */ + unsigned int unplugTimeout; }; Why store this value twice? I wanted the value to be available at the driver object, but saw that the parsing of the reading file put stuff in config. However, just realized that we get to the cfg via qemu_driver->config (as long as the lock is being held, which I think it's the case for all unplug operations). In case we still want this configurable timeout solution, I'll fix this in the next spin. Thanks, DHB Jano -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v1 1/3] adding unplug_timeout QEMU conf
Hi, On 8/19/19 6:16 AM, Ján Tomko wrote: On Sun, Aug 18, 2019 at 06:45:29PM -0300, Daniel Henrique Barboza wrote: For some architectures and setups, device removal can take longer than the default 5 seconds. This results in commands such as 'virsh setvcpus' to fire timeout messages even if the actual operation happened in the guest, causing confusion for the user. The commit that introduced this error message: commit e3229f6e4461cd1721dc68a32e16ab1718ae716e qemu: hotplug: Add support for VCPU unplug specifically says that we treat this differently than regular device detach: As the new code is using device_del all the implications of using it are present. Contrary to the device deletion code, the vcpu deletion code fails if the unplug request is not executed in time. Technically, we already did execute the unplug request so we lie to the user saying "operation failed". Maybe we can revisit the decision? [cc-ing pkrempa who added this] I have thought about making setvcpus asynchronous when it is a device_del operation. This would be more code (perhaps a new command to do that? Or a --asynchronous option?), and we can set user expectations properly by making it clear that this is a unplug request command, and the user will need to check the result personally in the guest. But then, if we are prepared to tell the user "go check yourself" we can simply change the current timeout message to say "vcpu unplug [...] timeout, check unplug status in the guest". This would be clearer, and the user wouldn't automatically assume that timeout == operation failed. Another thing we can do, instead of exposing the option to the user (which has a good potential for disaster - hence why I added a minimal value), is to simply set the timeout to a greater value (10? 15?) and be done with it. If we set the timeout to 15 seconds and change the timeout message to let the user know "we don't know, you'll need to check", like I said above, we have more resilience and will not alarm the user if a timeout still occurs. We'll also avoid exposing the timeout configuration like I'm doing here. This patch adds a new qemu.conf parameter called 'unplug_timeout' to handle these cases. If left unset, the current default timeout is used. To avoid user 'experimentation' with small timeouts, the current timeout is also the minimal value allowed. The reason for this timeout is that we originally promised something that we cannot deliver - a synchronous device detach API, while the operation itself is asynchronous. I'm not a fan of exposing it and making it configurable. Signed-off-by: Daniel Henrique Barboza --- src/qemu/libvirtd_qemu.aug | 3 +++ src/qemu/qemu.conf | 4 src/qemu/qemu_conf.c | 26 ++ src/qemu/qemu_conf.h | 5 + src/qemu/qemu_driver.c | 2 ++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 6 files changed, 41 insertions(+) [...] diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 0cbddd7a9c..29824e4e35 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -214,6 +214,8 @@ struct _virQEMUDriverConfig { gid_t swtpm_group; char **capabilityfilters; + + unsigned int unplugTimeout; }; /* Main driver state */ @@ -294,6 +296,9 @@ struct _virQEMUDriver { /* Immutable pointer, self-locking APIs */ virHashAtomicPtr migrationErrors; + + /* Immutable value */ + unsigned int unplugTimeout; }; Why store this value twice? I wanted the value to be available at the driver object, but saw that the parsing of the reading file put stuff in config. However, just realized that we get to the cfg via qemu_driver->config (as long as the lock is being held, which I think it's the case for all unplug operations). In case we still want this configurable timeout solution, I'll fix this in the next spin. Thanks, DHB Jano -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v1 1/3] adding unplug_timeout QEMU conf
On Sun, Aug 18, 2019 at 06:45:29PM -0300, Daniel Henrique Barboza wrote: For some architectures and setups, device removal can take longer than the default 5 seconds. This results in commands such as 'virsh setvcpus' to fire timeout messages even if the actual operation happened in the guest, causing confusion for the user. The commit that introduced this error message: commit e3229f6e4461cd1721dc68a32e16ab1718ae716e qemu: hotplug: Add support for VCPU unplug specifically says that we treat this differently than regular device detach: As the new code is using device_del all the implications of using it are present. Contrary to the device deletion code, the vcpu deletion code fails if the unplug request is not executed in time. Technically, we already did execute the unplug request so we lie to the user saying "operation failed". Maybe we can revisit the decision? [cc-ing pkrempa who added this] This patch adds a new qemu.conf parameter called 'unplug_timeout' to handle these cases. If left unset, the current default timeout is used. To avoid user 'experimentation' with small timeouts, the current timeout is also the minimal value allowed. The reason for this timeout is that we originally promised something that we cannot deliver - a synchronous device detach API, while the operation itself is asynchronous. I'm not a fan of exposing it and making it configurable. Signed-off-by: Daniel Henrique Barboza --- src/qemu/libvirtd_qemu.aug | 3 +++ src/qemu/qemu.conf | 4 src/qemu/qemu_conf.c | 26 ++ src/qemu/qemu_conf.h | 5 + src/qemu/qemu_driver.c | 2 ++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 6 files changed, 41 insertions(+) [...] diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 0cbddd7a9c..29824e4e35 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -214,6 +214,8 @@ struct _virQEMUDriverConfig { gid_t swtpm_group; char **capabilityfilters; + +unsigned int unplugTimeout; }; /* Main driver state */ @@ -294,6 +296,9 @@ struct _virQEMUDriver { /* Immutable pointer, self-locking APIs */ virHashAtomicPtr migrationErrors; + +/* Immutable value */ +unsigned int unplugTimeout; }; Why store this value twice? Jano signature.asc Description: PGP signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH v1 1/3] adding unplug_timeout QEMU conf
For some architectures and setups, device removal can take longer than the default 5 seconds. This results in commands such as 'virsh setvcpus' to fire timeout messages even if the actual operation happened in the guest, causing confusion for the user. This patch adds a new qemu.conf parameter called 'unplug_timeout' to handle these cases. If left unset, the current default timeout is used. To avoid user 'experimentation' with small timeouts, the current timeout is also the minimal value allowed. Signed-off-by: Daniel Henrique Barboza --- src/qemu/libvirtd_qemu.aug | 3 +++ src/qemu/qemu.conf | 4 src/qemu/qemu_conf.c | 26 ++ src/qemu/qemu_conf.h | 5 + src/qemu/qemu_driver.c | 2 ++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 6 files changed, 41 insertions(+) diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 2a99a0c55f..3bf94c9235 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -130,6 +130,8 @@ module Libvirtd_qemu = let capability_filters_entry = str_array_entry "capability_filters" + let unplug_timeout_entry = int_entry "unplug_timeout" + (* Each entry in the config is one of the following ... *) let entry = default_tls_entry | vnc_entry @@ -152,6 +154,7 @@ module Libvirtd_qemu = | nbd_entry | swtpm_entry | capability_filters_entry + | unplug_timeout_entry let comment = [ label "#comment" . del /#[ \t]*/ "# " . store /([^ \t\n][^\n]*)?/ . del /\n/ "\n" ] let empty = [ label "#empty" . eol ] diff --git a/src/qemu/qemu.conf b/src/qemu/qemu.conf index 8cabeccacb..c6d0f0940c 100644 --- a/src/qemu/qemu.conf +++ b/src/qemu/qemu.conf @@ -847,3 +847,7 @@ # may change across versions. # #capability_filters = [ "capname" ] + +# Timeout, in seconds, for unplug operations. Default and minimal value +# is 5. +#unplug_timeout = 5 diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 2953893337..83d4ac8310 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -69,6 +69,8 @@ VIR_LOG_INIT("qemu.qemu_conf"); #define QEMU_MIGRATION_PORT_MIN 49152 #define QEMU_MIGRATION_PORT_MAX 49215 +#define QEMU_UNPLUG_TIMEOUT 5 + static virClassPtr virQEMUDriverConfigClass; static void virQEMUDriverConfigDispose(void *obj); @@ -298,6 +300,8 @@ virQEMUDriverConfigPtr virQEMUDriverConfigNew(bool privileged) cfg->glusterDebugLevel = 4; cfg->stdioLogD = true; +cfg->unplugTimeout = QEMU_UNPLUG_TIMEOUT; + if (!(cfg->namespaces = virBitmapNew(QEMU_DOMAIN_NS_LAST))) goto error; @@ -1009,6 +1013,24 @@ virQEMUDriverConfigLoadCapsFiltersEntry(virQEMUDriverConfigPtr cfg, } +static int +virQEMUDriverConfigLoadUnplugTimeoutEntry(virQEMUDriverConfigPtr cfg, + virConfPtr conf) +{ +if (virConfGetValueUInt(conf, "unplug_timeout", &cfg->unplugTimeout) < 0) +return -1; + +if (cfg->unplugTimeout < QEMU_UNPLUG_TIMEOUT) { +virReportError(VIR_ERR_INTERNAL_ERROR, + _("unplug_timeout: value must be greater " + "than or equal to %d"), QEMU_UNPLUG_TIMEOUT); +return -1; +} + +return 0; +} + + int virQEMUDriverConfigLoadFile(virQEMUDriverConfigPtr cfg, const char *filename, bool privileged) @@ -1081,6 +1103,10 @@ int virQEMUDriverConfigLoadFile(virQEMUDriverConfigPtr cfg, if (virQEMUDriverConfigLoadCapsFiltersEntry(cfg, conf) < 0) goto cleanup; +if (virQEMUDriverConfigLoadUnplugTimeoutEntry(cfg, conf) < 0) +goto cleanup; + + ret = 0; cleanup: diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 0cbddd7a9c..29824e4e35 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -214,6 +214,8 @@ struct _virQEMUDriverConfig { gid_t swtpm_group; char **capabilityfilters; + +unsigned int unplugTimeout; }; /* Main driver state */ @@ -294,6 +296,9 @@ struct _virQEMUDriver { /* Immutable pointer, self-locking APIs */ virHashAtomicPtr migrationErrors; + +/* Immutable value */ +unsigned int unplugTimeout; }; virQEMUDriverConfigPtr virQEMUDriverConfigNew(bool privileged); diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 11f97dbc65..5e86d4cfe6 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1078,6 +1078,8 @@ qemuStateInitialize(bool privileged, if (!qemu_driver->workerPool) goto error; +qemu_driver->unplugTimeout = cfg->unplugTimeout * 1000ul; + qemuProcessReconnectAll(qemu_driver); qemuAutostartDomains(qemu_driver); diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 9f81759cb5..a1cd48f1ee 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirt