I'd like to resurface an earlier proposal [1] to make the unplug timeout user-configurable. With production guests running sustained, non-trivial workloads, we find that hot(un)plug operations regularly takes longer than the current default of 5s. However, the inability to easily configure this timeout means that in the worst case a recompilation with appropriate values set is required for each deployment, which is cumbersome to maintain and patch.
The previous proposal [1] also initially made the timeout configurable, however that was specifically for the ppc64 architecture. Therefore, imposing this change for an entire architecture made sense, since it was a universal requirement for that platform. With deployment environments as diverse and varied as x86, having a single, inconfigurable value does not make as much sense now. For x86, such a change is not ideal since the desired timeout value depends very heavily on the deployment environment in use - if it is being performed by a human anything more than 10s is an undesirable user experience [2], as pointed out in the original discussion. However, more relaxed timeout values (e.g. 15-20s) can be more acceptable if the operations are background/ not directly user-facing. It is in these cases where the ability to change the timeout more easily becomes more desirable. To prevent misconfiguration, while maintaining the same out-of-box experience for existing users, I'd also like to propose the following- 1. As mentioned in the original proposal [1], the timeout cannot be set to a value lower than the default timeout of 5s; in these cases a warning will be emitted and the timeout will be reset to the default value. 2. Including an upper cap would also be desirable, in my opinion - any API consumer should not wait endlessly for an async operation to finish. 3. This configuration option will only be described in qemu.conf, not imposed - the out-of-box experience will remain unchanged at the current default (5s) for everyone who does not wish to tweak/is unaware of this value. An administrator who knowingly changes these values is therefore not blindsided by any perceived "hangs" resulting from the use of an increased timeout window. Since this would be a clearly documented and explicitly opt-in behaviour, the benefits in the form of easy modification based on evolving operational needs would be quite significant. Given the varying nature of production environments and specific operational needs at scale, it would therefore be beneficial to reconsider the merits of the original proposal. [1] https://lists.libvirt.org/archives/list/[email protected]/thread/TXYVZ4BZW5GN3XS3YFJ3ODVVV5NPBTPY/ [2] https://www.nngroup.com/articles/response-times-3-important-limits/
