I'd like to resurface an earlier proposal [1] to make the unplug timeout
user-configurable. With production guests running sustained, non-trivial 
workloads, we find that hot(un)plug operations regularly takes longer than the 
current default of 5s. However, the inability to easily configure this timeout 
means that in the worst case a recompilation with appropriate values set is 
required for each deployment, which is cumbersome to maintain and patch.

The previous proposal [1] also initially made the timeout configurable, however 
that was specifically for the ppc64 architecture. Therefore, imposing this 
change for an entire architecture made sense, since it was a universal 
requirement for that platform. With deployment environments as diverse and 
varied as x86, having a single, inconfigurable value does not make as much 
sense now.

For x86, such a change is not ideal since the desired timeout value 
depends very heavily on the deployment environment in use - if it is being 
performed by a human anything more than 10s is an undesirable user experience 
[2], as pointed out in the original discussion. However, more relaxed timeout 
values (e.g. 15-20s) can be more acceptable if the operations are background/
not directly user-facing. It is in these cases where the ability to change the 
timeout more easily becomes more desirable.

To prevent misconfiguration, while maintaining the same out-of-box experience
for existing users, I'd also like to propose the following-

1. As mentioned in the original proposal [1], the timeout cannot be set to a 
value lower than the default timeout of 5s; in these cases a warning
will be emitted and the timeout will be reset to the default value.
2. Including an upper cap would also be desirable, in my opinion - any API 
consumer should not wait endlessly for an async operation to finish.
3. This configuration option will only be described in qemu.conf, not imposed - 
the out-of-box experience will remain unchanged at the current default (5s) for 
everyone who does not wish to tweak/is unaware of this value. An administrator 
who knowingly changes these values is therefore not blindsided by any perceived 
"hangs" resulting from the use of an increased timeout window.

Since this would be a clearly documented and explicitly opt-in behaviour, the 
benefits in the form of easy modification based on evolving operational needs 
would be quite significant. Given the varying nature of production environments 
and specific operational needs at scale, it would therefore be beneficial to 
reconsider the merits of the original proposal.

[1] 
https://lists.libvirt.org/archives/list/[email protected]/thread/TXYVZ4BZW5GN3XS3YFJ3ODVVV5NPBTPY/
[2] https://www.nngroup.com/articles/response-times-3-important-limits/

Reply via email to