Re: [libvirt] [PATCH] nwfilter: intantiate filters on loading driver
On 11/1/18 3:26 AM, Nikolay Shirokovskiy wrote: > > > On 01.11.2018 03:48, John Ferlan wrote: >> >> >> On 10/31/18 10:41 AM, Nikolay Shirokovskiy wrote: >>> On 31.10.2018 16:14, John Ferlan wrote: On 10/30/18 3:24 AM, Nikolay Shirokovskiy wrote: > > > On 29.10.2018 22:37, John Ferlan wrote: >> >> >> On 10/15/18 4:26 AM, Nikolay Shirokovskiy wrote: >>> Before using filters binding filters instantiation was done by >>> hypervisors >>> drivers initialization code (qemu was the only such hypervisor). Now >>> qemu >>> reconnection done supposes it should be done by nwfilter driver >>> probably. >>> Let's add this missing step. >>> >>> Signed-off-by: Nikolay Shirokovskiy >>> --- >>> src/nwfilter/nwfilter_driver.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >> >> If there's research you've done where the instantiation was done before >> introduction of the nwfilter bindings that would be really helpful... >> >> I found that virNWFilterBuildAll was introduced in commit 3df907bfff. >> There were 2 callers for it: >> >>1. virNWFilterTriggerRebuildImpl >>2. nwfilterStateReload >> >> The former called as part of the virNWFilterConfLayerInit callback >> during nwfilterStateInitialize (about 50 lines earlier). First off let me say you certainly find a lot of interesting bugs/issues that are complex and that's good. Often times I wonder how you trip across things or how to quantify what you've found. Some are simpler than others. This one on the surface would seem to be simple, but I keep wondering is there a downside. >>> >>> The issue is related to my recent posts: >>> >>> [1] [RFC] Faster libvirtd restart with nwfilter rules >>> https://www.redhat.com/archives/libvir-list/2018-September/msg01206.html >>> which continues here: >>> https://www.redhat.com/archives/libvir-list/2018-October/msg00657.html >>> >>> In short if there is lots of VMs with filters then libvirtd restart takes a >>> lot of time >>> during which libvirtd is unresponsive. By the way the issue is found in >>> libvirt >>> versions 3.9.0 and ealier (don't know of newer versions, Virtuozzo is based >>> on 3.9.0 now, >>> just like Centos7 :) ) >> >> So many different issues - trying to focus on just the one for this >> particular patch. >> >>> >>> And attempts to fix it: >>> >>> [2] [PATCH RFC 0/4] nwfilter: don't reinstantiate filters if they are not >>> changed >>> https://www.redhat.com/archives/libvir-list/2018-October/msg00904.html >>> >>> [3] [PATCH v2 0/2] nwfilter: don't reinstantiate rules if there is no need >>> to >>> https://www.redhat.com/archives/libvir-list/2018-October/msg01317.html >>> >>> And the reason I started v2 is Laine mentioned that it is important to >>> reinstantiate rules on restart (v1 do not care if somebody messed tables). >>> >> >> I've seen the patches and even read some briefly, but still need to take >> this particular patch as separate from those. >> >>> And I discovered that upstream version stops reinstantiating rules at all >>> after introducing filters bindings. And this functionality is old: >> >> So the key is "when" did that happen? That is can you pinpoint or >> bisect a time in history where the filters were instantiated? And by >> instantiated what exactly (call) are you referencing? > > The below commit is the one. It adds instantiating filters on > libvirtd restart. By instantiating I mean that firewall rules > correspondent to filters are actually get [re]added to iptables etc. > >> >>> >>> commit cf6f8b9a9720fe5323a84e690de9fbf8ba41f6ac >>> Author: Stefan Berger >>> Date: Mon Aug 16 12:59:54 2010 -0400 >>> >>> nwfilter: extend nwfilter reload support >>> >> >> Yes, nwfilter is old, brittle, and in need of attention. Not sure anyone >> really wants to take on the task though! I just realized I never got the >> common object code implemented there. Mostly because of lock issues. >> Suffice to say there's more "interesting" changes in the nwfilter space >> being discussed internally. >> The nwfilter processing is kindly said rather strange, complex, and to a degree fragile. Thus when patches come along they are met with greater scrutiny. From just reading the commit message here I don't get a sense for the problem and why the solution fixes it. So I'm left to try and research myself and ask a lot of questions. BTW, some of the detail provided in this response is really useful as either part of a cover or after the --- in the single patch. I would think you'd "know" when you've done lots of research into a problem that providing reviews more than a mere hint would be useful! For nwfilter bindings, it's hard to imagine this is one of those - it just happened type events. There seems to be a very specific sequence that's missing from th
Re: [libvirt] [PATCH] nodedev: Document the udevEventHandleThread
On 11/1/18 7:36 AM, Erik Skultety wrote: > On Fri, Oct 19, 2018 at 08:04:43AM -0400, John Ferlan wrote: >> Add details of the algorithm and why it was used to help future >> readers understand the issues encountered. Based largely off a >> description provided by Erik Skultety . > > Since the link to the archive would be missing from the commit message, I feel > like simply blaming commit cdbe13329a7 for lacking documentation/justification > of the algorithm used is the right way of composing the commit message. > OK - thanks for the commit link... > I appreciate the effort to document it properly, although I feel it's a bit > overwhelming, so the comments I have below is just a sheer attempt to reduce > the amount of text documenting a single function, you know, if a function > needs > a comment like that, it suggests that there's something wrong with that > function... > Understood - I threw a lot of spaghetti on the wall and altering what's here is perfectly fine. I didn't want to leave something out that you may have found important... BTW: Another explanation for a lengthy function header - it may suggest that if you're going to post a patch changing it's logic, you had better know what you're doing ;-) >> >> Signed-off-by: John Ferlan >> --- >> >> see: https://www.redhat.com/archives/libvir-list/2018-October/msg00915.html >> >> src/node_device/node_device_udev.c | 49 ++ >> 1 file changed, 49 insertions(+) >> >> diff --git a/src/node_device/node_device_udev.c >> b/src/node_device/node_device_udev.c >> index 22897591de..8ca81e65ad 100644 >> --- a/src/node_device/node_device_udev.c >> +++ b/src/node_device/node_device_udev.c >> @@ -1591,6 +1591,53 @@ udevEventMonitorSanityCheck(udevEventDataPtr priv, >> } >> >> >> +/** >> + * udevEventHandleThread >> + * @opaque: unused >> + * >> + * Thread to handle the udevEventHandleCallback processing when udev >> + * tells us there's a device change for us (add, modify, delete, etc). >> + * >> + * Management of the processing udev device notification is a delicate >> + * balance between the udev process, the scheduler, and when exactly the >> + * data from/for the socket is received. > > I'd drop ^this sentence, the issue with this unfortunate mix of factors is > explained further down, let's try keeping the amount of commentary down. > I'll move it to the commit message ;-) >> + * The udev processing code notifies >> + * the udevEventHandleCallback that it has data; however, the actual recv() >> + * done in the udev code to fill in the @device structure can be delayed >> + * due to how threads are scheduled. > > How about: > "When we get notified by udev that there are data to be processed, the actual > data retrieval by libudev may be delayed due to how threads are scheduled." > > + * >> + * As it turns out, the event loop could be scheduled (and it in fact >> + * was) earlier than the handler thread. What that essentially means is >> + * that by the time the thread actually handled the event and read the >> + * data from the monitor, the event loop fired the very same event, simply >> + * because the data hadn't been retrieved from the socket at that point yet. > > ^This doesn't need to be a separate paragraph, you can connect it to the > previous one. Also, some words don't feel right for commentary purposes (I > know > they're mine :D). How about: > > "In fact, the event loop could be scheduled earlier than the handler thread, > thus potentially emitting the very same event the handler thread is currently > trying to process, simply because the data hadn't been retrieved from the > socket at that time yet." > >> + * >> + * Thus this algorithm is that once udevEventHandleCallback is notified >> + * there is data, > > ^this can be dropped... > >> + * this loop will process as much data as possible until >> + * udev_monitor_receive_device fails to get the @device. This may be >> + * because we've processed everything or because the scheduling thread >> + * hasn't been able to populate data in the socket. Once we're done >> + * processing everything we can, wait on the next event/notification. > > How about: > > "Instead of waiting for the next event after retrieving a device data we keep > reading from the udev monitor until we encounter one of EAGAIN or EWOULDBLOCK > errors, only then we'll wait for the next event." > > Also, I'd move that bit into the function body, so it's directly tied to > the piece of code responsible for it. > >> + * Although this may incur a slight delay if the socket data wasn't >> + * populated, the event/notifications are fairly consistent so there's >> + * no need to use virCondWaitUntil. > > ^This could be dropped. > >> + * >> + * NB: Usage of event based socket algorithm causes some issues with >> + * older platforms such as CentOS 6 in which the data socket is opened >> + * without the NONBLOCK bit set. Still even if the reset of priv->dataReady >> + * were moved to ju
[libvirt] [PATCH v2] nodedev: Document the udevEventHandleThread
Commit cdbe1332 neglected to document the API. So let's add some details about the algorithm and why it was used to help future readers understand the issues encountered. Based largely off a description provided by Erik Skultety . NB: Management of the processing udev device notification is a delicate balance between the udev process, the scheduler, and when exactly the data from/for the socket is received. The balance is particularly important for environments when multiple devices are added into the system more or less simultaneously such as is done for mdev. In these cases scheduler blocking on the udev recv() call is more noticeable. It's expected that future devices will follow similar algorithms. Even though the algorithm does present some challenges for older OS's (such as Centos 6), trying to rewrite the algorithm to fit both models would be more complex and involve pulling the monitor object out of the private data lockable object and would need to be guarded by a separate lock. Devising such an algorithm to work around issues with older OS's at the expense of more modern OS algorithms in newer event processing code may result in unexpected issues, so the choice is to encourage use of newer OS's with newer udev event processing code. Signed-off-by: John Ferlan --- v1: https://www.redhat.com/archives/libvir-list/2018-October/msg01053.html Changes are from code review with some minor tweaks/editing as I went along. Mistakes are my own! src/node_device/node_device_udev.c | 28 1 file changed, 28 insertions(+) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 22897591de..f2c2299d4d 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1591,6 +1591,26 @@ udevEventMonitorSanityCheck(udevEventDataPtr priv, } +/** + * udevEventHandleThread + * @opaque: unused + * + * Thread to handle the udevEventHandleCallback processing when udev + * tells us there's a device change for us (add, modify, delete, etc). + * + * Once notified there is data to be processed, the actual @device + * data retrieval by libudev may be delayed due to how threads are + * scheduled. In fact, the event loop could be scheduled earlier than + * the handler thread, thus potentially emitting the very same event + * the handler thread is currently trying to process, simply because + * the data hadn't been retrieved from the socket. + * + * NB: Usage of event based socket algorithm causes some issues with + * older platforms such as CentOS 6 in which the data socket is opened + * without the NONBLOCK bit set. That can be avoided by moving the reset + * priv->dataReady to just after the udev_monitor_receive_device; however, + * scheduler issues would still come into play. + */ static void udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) { @@ -1637,6 +1657,9 @@ udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) return; } +/* Trying to move the reset of the @priv->dataReady flag to + * after the udev_monitor_receive_device wouldn't help much + * due to event mgmt and scheduler timing. */ virObjectLock(priv); priv->dataReady = false; virObjectUnlock(priv); @@ -1646,6 +1669,11 @@ udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) udevHandleOneDevice(device); udev_device_unref(device); + +/* Instead of waiting for the next event after processing @device + * data, let's keep reading from the udev monitor and only wait + * for the next event once either a EAGAIN or a EWOULDBLOCK error + * is encountered. */ } } -- 2.17.2 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH v2 3/3] qemu: Narrow the shutdown reconnection failure reason window
The current qemuProcessReconnect logic paints a broad brush determining that the shutdown reason must be crashed if it was determined that the domain was started with -no-shutdown; however, there's many other ways to get to the error label, so let's narrow our reasoning window for using VIR_DOMAIN_SHUTOFF_CRASHED to the period where we essentially know we've tried to create to the monitor and before we were successful in opening the connection. Failures that occur outside that window would thus be considered as VIR_DOMAIN_SHUTOFF_UNKNOWN, at least for now. Signed-off-by: John Ferlan --- src/qemu/qemu_process.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7112f3c048..85f30bf9d4 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7757,6 +7757,7 @@ qemuProcessReconnect(void *opaque) bool jobStarted = false; virCapsPtr caps = NULL; bool retry = true; +bool tryMonReconn = false; VIR_FREE(data); @@ -7797,6 +7798,8 @@ qemuProcessReconnect(void *opaque) VIR_DEBUG("Reconnect monitor to def=%p name='%s' retry=%d", obj, obj->def->name, retry); +tryMonReconn = true; + /* XXX check PID liveliness & EXE path */ if (qemuConnectMonitor(driver, obj, QEMU_ASYNC_JOB_NONE, retry, NULL) < 0) goto error; @@ -7993,7 +7996,8 @@ qemuProcessReconnect(void *opaque) * If we cannot get to the monitor when the QEMU command * line used -no-shutdown, then we can safely say that the * domain crashed; otherwise we don't really know. */ -if (qemuDomainIsUsingNoShutdown(priv)) +if (!priv->mon && tryMonReconn && +qemuDomainIsUsingNoShutdown(priv)) state = VIR_DOMAIN_SHUTOFF_CRASHED; else state = VIR_DOMAIN_SHUTOFF_UNKNOWN; -- 2.17.2 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH v2 2/3] qemu: Restore lost shutdown reason
When qemuProcessReconnectHelper was introduced (commit d38897a5d) reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was removed in commit fe35b1ad6 the conditional state was just left at VIR_DOMAIN_SHUTOFF_CRASHED. So introduce qemuDomainIsUsingNoShutdown which will manage the condition when the domain was started with -no-shutdown so that when/if reconnection failure occurs we can restore the decision point used to determine whether CRASHED or UNKNOWN is provided. Signed-off-by: John Ferlan --- src/qemu/qemu_command.c | 6 +- src/qemu/qemu_domain.c | 17 + src/qemu/qemu_domain.h | 3 +++ src/qemu/qemu_process.c | 15 ++- 4 files changed, 31 insertions(+), 10 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 6e3ff67660..5b94931af6 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6415,11 +6415,7 @@ qemuBuildPMCommandLine(virCommandPtr cmd, if (priv->allowReboot == VIR_TRISTATE_BOOL_NO) virCommandAddArg(cmd, "-no-reboot"); -/* If JSON monitor is enabled, we can receive an event - * when QEMU stops. If we use no-shutdown, then we can - * watch for this event and do a soft/warm reboot. - */ -if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES) +if (qemuDomainIsUsingNoShutdown(priv)) virCommandAddArg(cmd, "-no-shutdown"); if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_NO_ACPI)) { diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index ba3fff607a..045a7b4ac0 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -13561,3 +13561,20 @@ qemuDomainRunningReasonToResumeEvent(virDomainRunningReason reason) return VIR_DOMAIN_EVENT_RESUMED_UNPAUSED; } + + +/* qemuDomainIsUsingNoShutdown: + * @priv: Domain private data + * + * If JSON monitor is enabled, we can receive an event when QEMU stops. If + * we use no-shutdown, then we can watch for this event and do a soft/warm + * reboot. + * + * Returns: @true when -no-shutdown either should be or was added to the + * command line. + */ +bool +qemuDomainIsUsingNoShutdown(qemuDomainObjPrivatePtr priv) +{ +return priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES; +} diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 80bd4bde91..554435e0a9 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -1089,4 +1089,7 @@ void qemuDomainStorageIdReset(qemuDomainObjPrivatePtr priv); virDomainEventResumedDetailType qemuDomainRunningReasonToResumeEvent(virDomainRunningReason reason); +bool +qemuDomainIsUsingNoShutdown(qemuDomainObjPrivatePtr priv); + #endif /* __QEMU_DOMAIN_H__ */ diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 5232f761af..7112f3c048 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7988,11 +7988,16 @@ qemuProcessReconnect(void *opaque) if (virDomainObjIsActive(obj)) { /* We can't get the monitor back, so must kill the VM * to remove danger of it ending up running twice if - * user tries to start it again later - * If we couldn't get the monitor since QEMU supports - * no-shutdown, we can safely say that the domain - * crashed ... */ -state = VIR_DOMAIN_SHUTOFF_CRASHED; + * user tries to start it again later. + * + * If we cannot get to the monitor when the QEMU command + * line used -no-shutdown, then we can safely say that the + * domain crashed; otherwise we don't really know. */ +if (qemuDomainIsUsingNoShutdown(priv)) +state = VIR_DOMAIN_SHUTOFF_CRASHED; +else +state = VIR_DOMAIN_SHUTOFF_UNKNOWN; + /* If BeginJob failed, we jumped here without a job, let's hope another * thread didn't have a chance to start playing with the domain yet * (it's all we can do anyway). -- 2.17.2 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH v2 0/3] qemu: Restore lost shutdown reason
v1: https://www.redhat.com/archives/libvir-list/2018-October/msg00945.html Changes since v1: Patch 1 (NEW) - Set the priv->allowReboot prior to any error processing since we could/should be using that. Patch 2 (ADJUSTED LOGIC) - Rather than open code the "reason" to use the -no-shutdown flag, let's create a qemu_domain helper for that so that both the command line building and the Reconnection failure logic can use it. Patch 3 (NEW) - Let's narrow the window for using VIR_DOMAIN_SHUTOFF_CRASHED to just the period where we try to open a monitor channel. If we cannot do so, then "assume" the reason was crashed. There are a few open failure steps that may not exactly fit the model, but those are probably splitting hairs. John Ferlan (3): qemu: Move allow reboot check setting qemu: Restore lost shutdown reason qemu: Narrow the shutdown reconnection failure reason window src/qemu/qemu_command.c | 6 +- src/qemu/qemu_domain.c | 17 + src/qemu/qemu_domain.h | 3 +++ src/qemu/qemu_process.c | 27 ++- 4 files changed, 39 insertions(+), 14 deletions(-) -- 2.17.2 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH v2 1/3] qemu: Move allow reboot check setting
Checking and setting the priv->allowReboot can be done before we start processing the job. A subsequent patch will make use of the value to make decisions in the error label, so we need to have it set properly. Signed-off-by: John Ferlan --- src/qemu/qemu_process.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 9cf971808c..5232f761af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7767,6 +7767,10 @@ qemuProcessReconnect(void *opaque) cfg = virQEMUDriverGetConfig(driver); priv = obj->privateData; +/* If we are connecting to a guest started by old libvirt there is no + * allowReboot in status XML and we need to initialize it. */ +qemuProcessPrepareAllowReboot(obj); + if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto error; @@ -7783,10 +7787,6 @@ qemuProcessReconnect(void *opaque) if (qemuDomainMasterKeyReadFile(priv) < 0) goto error; -/* If we are connecting to a guest started by old libvirt there is no - * allowReboot in status XML and we need to initialize it. */ -qemuProcessPrepareAllowReboot(obj); - if (qemuHostdevUpdateActiveDomainDevices(driver, obj->def) < 0) goto error; -- 2.17.2 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2] qemu: Introduce caching whether /dev/kvm is accessible
On 10/30/2018 01:55 PM, Daniel P. Berrangé wrote: > On Tue, Oct 30, 2018 at 10:32:08AM +, Daniel P. Berrangé wrote: >> On Tue, Oct 30, 2018 at 11:08:45AM +0100, Michal Privoznik wrote: >>> On 10/30/2018 10:35 AM, Daniel P. Berrangé wrote: On Tue, Oct 30, 2018 at 09:13:50AM +0100, Michal Privoznik wrote: > On 10/29/2018 06:34 PM, Marc Hartmayer wrote: >> Introduce caching whether /dev/kvm is usable as the QEMU user:QEMU >> group. This reduces the overhead of the QEMU capabilities cache >> lookup. Before this patch there were many fork() calls used for >> checking whether /dev/kvm is accessible. Now we store the result >> whether /dev/kvm is accessible or not and we only need to re-run the >> virFileAccessibleAs check if the ctime of /dev/kvm has changed. >> >> Suggested-by: Daniel P. Berrangé >> Signed-off-by: Marc Hartmayer >> --- >> src/qemu/qemu_capabilities.c | 54 ++-- >> 1 file changed, 52 insertions(+), 2 deletions(-) >> >> diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c >> index e228f52ec0bb..85516954149b 100644 >> --- a/src/qemu/qemu_capabilities.c >> +++ b/src/qemu/qemu_capabilities.c >> @@ -3238,6 +3238,10 @@ struct _virQEMUCapsCachePriv { >> virArch hostArch; >> unsigned int microcodeVersion; >> char *kernelVersion; >> + >> +/* cache whether /dev/kvm is usable as runUid:runGuid */ >> +virTristateBool kvmUsable; >> +time_t kvmCtime; >> }; >> typedef struct _virQEMUCapsCachePriv virQEMUCapsCachePriv; >> typedef virQEMUCapsCachePriv *virQEMUCapsCachePrivPtr; >> @@ -3824,6 +3828,52 @@ virQEMUCapsSaveFile(void *data, >> } >> >> >> +/* Determine whether '/dev/kvm' is usable as QEMU user:QEMU group. */ >> +static bool >> +virQEMUCapsKVMUsable(virQEMUCapsCachePrivPtr priv) >> +{ >> +struct stat sb; >> +static const char *kvm_device = "/dev/kvm"; >> +virTristateBool value; >> +virTristateBool cached_value = priv->kvmUsable; >> +time_t kvm_ctime; >> +time_t cached_kvm_ctime = priv->kvmCtime; >> + >> +if (stat(kvm_device, &sb) < 0) { >> +virReportSystemError(errno, >> + _("Failed to stat %s"), kvm_device); >> +return false; >> +} >> +kvm_ctime = sb.st_ctime; > > This doesn't feel right. /dev/kvm ctime is changed every time qemu is > started or powered off (try running stat over it before and after a > domain is started/shut off). So effectively we will fork more often than > we would think. Should we cache inode number instead? Because for all > that we care is simply if the file is there. Urgh, that is a bit strange and not the usual semantics for timestamps :-( >>> >>> Indeed. >>> We can't stat the inode - the code was explicitly trying to cope with the way /dev/kvm can change permissions when udev rules get applied. We would have to compare the user, group, permissions mask and even ACL, or a hash of those. >>> >>> Well, we can use ctime as suggested and post a patch for kernel to fix >>> ctime behaviour. Until the patch is merged our behaviour would be >>> suboptimal, but still better than it is now. >> >> I guess lets talk to KVM team for their input on this and then decide >> what todo. > > Hmm, I wonder if it is not actually a kernel problem, but rather something > in userspace genuinely touching the device in a way that caues these > timestamps to be updated. > > eg I vaguely recall a udev rule that resets permissions on device nodes > whenever an FD is closed, which might cause this kind of behaviour After trying hard to find the rule that mangles the ctime I've found the following bug in udev: https://github.com/zippy2/systemd/commit/51915e4e6a64c581da321c25448d80626d8e8408?diff=unified I've send that as a pull request to systemd. So after all, ctime might be usable again. Michal -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH] nodedev: Add new module node_device_util
There's a lot of stuff going on in src/conf/nodedev_conf which not always has to do anything with config, so even though we're trying, we're not really consistent in putting only parser/formatter related stuff here like we do for domains. So, start the cleanup simply by adding a new module to the nodedev driver and put a few helper APIs which want to open a secondary driver connection in there (similar to storage_util module). Signed-off-by: Erik Skultety --- I verified the build with debian 9, centos 7, fedora 28, rawhide, and freebsd 11 src/conf/Makefile.inc.am | 1 + src/conf/node_device_conf.c | 199 --- src/conf/node_device_conf.h | 11 -- src/conf/virstorageobj.c | 1 + src/libvirt_private.syms | 8 +- src/node_device/Makefile.inc.am | 17 +- src/node_device/node_device_driver.c | 1 + src/node_device/node_device_util.c | 229 +++ src/node_device/node_device_util.h | 35 src/storage/Makefile.inc.am | 1 + src/storage/storage_backend_scsi.c | 1 + 11 files changed, 290 insertions(+), 214 deletions(-) create mode 100644 src/node_device/node_device_util.c create mode 100644 src/node_device/node_device_util.h diff --git a/src/conf/Makefile.inc.am b/src/conf/Makefile.inc.am index af23810640..7cb6c29d70 100644 --- a/src/conf/Makefile.inc.am +++ b/src/conf/Makefile.inc.am @@ -163,6 +163,7 @@ libvirt_la_BUILT_LIBADD += libvirt_conf.la libvirt_conf_la_SOURCES = $(CONF_SOURCES) libvirt_conf_la_CFLAGS = \ -I$(srcdir)/conf \ + -I$(srcdir)/node_device \ $(AM_CFLAGS) \ $(NULL) libvirt_conf_la_LDFLAGS = $(AM_LDFLAGS) diff --git a/src/conf/node_device_conf.c b/src/conf/node_device_conf.c index 03bd794dc0..74a7bc3933 100644 --- a/src/conf/node_device_conf.c +++ b/src/conf/node_device_conf.c @@ -2220,205 +2220,6 @@ virNodeDevCapsDefFree(virNodeDevCapsDefPtr caps) } -/* virNodeDeviceGetParentName - * @conn: Connection pointer - * @nodedev_name: Node device to lookup - * - * Lookup the node device by name and return the parent name - * - * Returns parent name on success, caller is responsible for freeing; - * otherwise, returns NULL on failure - */ -char * -virNodeDeviceGetParentName(virConnectPtr conn, - const char *nodedev_name) -{ -virNodeDevicePtr device = NULL; -char *parent; - -if (!(device = virNodeDeviceLookupByName(conn, nodedev_name))) { -virReportError(VIR_ERR_XML_ERROR, - _("Cannot find '%s' in node device database"), - nodedev_name); -return NULL; -} - -ignore_value(VIR_STRDUP(parent, virNodeDeviceGetParent(device))); -virObjectUnref(device); - -return parent; -} - - -/** - * @fchost: Pointer to vHBA adapter - * - * Create a vHBA for Storage. This code accomplishes this via searching - * through the sysfs for scsi_host/fc_host in order to first ensure some - * vHBA doesn't already exist for the requested wwnn/wwpn (e.g. an unmanaged - * vHBA) and to search for the parent vport capable scsi_host by name, - * wwnn/wwpn, or fabric_wwn (if provided). If no parent is provided, then - * a vport capable scsi_host will be selected. - * - * Returns vHBA name on success, NULL on failure with an error message set - */ -char * -virNodeDeviceCreateVport(virStorageAdapterFCHostPtr fchost) -{ -unsigned int parent_host; -char *name = NULL; -char *parent_hoststr = NULL; -bool skip_capable_check = false; - -VIR_DEBUG("parent='%s', wwnn='%s' wwpn='%s'", - NULLSTR(fchost->parent), fchost->wwnn, fchost->wwpn); - -if (fchost->parent) { -if (VIR_STRDUP(parent_hoststr, fchost->parent) < 0) -goto cleanup; -} else if (fchost->parent_wwnn && fchost->parent_wwpn) { -if (!(parent_hoststr = virVHBAGetHostByWWN(NULL, fchost->parent_wwnn, - fchost->parent_wwpn))) { -virReportError(VIR_ERR_XML_ERROR, "%s", - _("cannot find parent using provided wwnn/wwpn")); -goto cleanup; -} -} else if (fchost->parent_fabric_wwn) { -if (!(parent_hoststr = - virVHBAGetHostByFabricWWN(NULL, fchost->parent_fabric_wwn))) { -virReportError(VIR_ERR_XML_ERROR, "%s", - _("cannot find parent using provided fabric_wwn")); -goto cleanup; -} -} else { -if (!(parent_hoststr = virVHBAFindVportHost(NULL))) { -virReportError(VIR_ERR_XML_ERROR, "%s", - _("'parent' for vHBA not specified, and " - "cannot find one on this host")); -goto cleanup; -} -skip_capable_check = true; -} - -if (virSCSIHostGetNumber(parent_hoststr, &parent_host) < 0) -goto cleanup; - -/* NOTE: - * We do not save the parent_host
Re: [libvirt] [PATCH] qemu: Restore lost shutdown reason
On 10/22/18 3:36 AM, Nikolay Shirokovskiy wrote: > > > On 18.10.2018 18:28, John Ferlan wrote: >> When qemuProcessReconnectHelper was introduced (commit d38897a5d) >> reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that >> was changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED >> or VIR_DOMAIN_SHUTOFF_UNKNOWN. >> >> When QEMU_CAPS_NO_SHUTDOWN checking was removed in commit fe35b1ad6 >> the conditional state was just left at VIR_DOMAIN_SHUTOFF_CRASHED. >> >> Restore the logic adjustment using the same conditions as command >> line building and alter the comment to describe the reasoning. >> >> Signed-off-by: John Ferlan >> --- >> >> This was "discovered" while reviewing Nikolay's patch related >> to adding "DAEMON" as the shutdown reason: >> >> https://www.redhat.com/archives/libvir-list/2018-October/msg00493.html >> >> While working through the iterations of that - figured I'd at >> least post this. >> >> >> src/qemu/qemu_process.c | 15 ++- >> 1 file changed, 10 insertions(+), 5 deletions(-) >> >> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c >> index 3955eda17c..4a39111d51 100644 >> --- a/src/qemu/qemu_process.c >> +++ b/src/qemu/qemu_process.c >> @@ -7985,11 +7985,16 @@ qemuProcessReconnect(void *opaque) >> if (virDomainObjIsActive(obj)) { >> /* We can't get the monitor back, so must kill the VM >> * to remove danger of it ending up running twice if >> - * user tries to start it again later >> - * If we couldn't get the monitor since QEMU supports >> - * no-shutdown, we can safely say that the domain >> - * crashed ... */ >> -state = VIR_DOMAIN_SHUTOFF_CRASHED; >> + * user tries to start it again later. >> + * >> + * If we cannot get to the monitor when the QEMU command >> + * line used -no-shutdown, then we can safely say that the >> + * domain crashed; otherwise we don't really know. */ >> +if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES) >> +state = VIR_DOMAIN_SHUTOFF_CRASHED; >> +else >> +state = VIR_DOMAIN_SHUTOFF_UNKNOWN; >> + >> /* If BeginJob failed, we jumped here without a job, let's hope >> another >> * thread didn't have a chance to start playing with the domain yet >> * (it's all we can do anyway). >> > > This is reasonable but not complete. This is only applied to case when we do > connect(2) I understand your point; however, the goal of this patch wasn't to fix the various other failure scenarios/reasons. It was to merely restore the lost UNKNOWN reason in the event that either priv->monJSON == false (unlikely) or priv->allowReboot != VIR_TRISTATE_BOOL_YES (possible). Rather than copy-pasta, I'll create qemuDomainIsUsingNoShutdown(priv) which will perform the condition checks for both command line generation and reconnection failure paths. If that's not desirable, then that's fine. I won't spend more cycles on this. John > to qemu and it is failed with ECONNREFUSED which means qemu process does not > exists > (in which case it is either crashed if was configured with -no-shutdown or > terminated > for unknown reason) or just closed monitor connection (in this case we are > going > to terminate it by ourselves). > > But there are other cases. For example we can jump to error path even before > connect(2) > or after successfull connect. See discussion of such cases in [1]. > > [1] https://www.redhat.com/archives/libvir-list/2018-October/msg01031.html > > Nikolay > -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v1 2/7] pcihp: overwrite hotplug handler recursively from the start
On Wed, 24 Oct 2018 12:19:25 +0200 David Hildenbrand wrote: > For now, the hotplug handler is not called for devices that are > being cold plugged. The hotplug handler is setup when the machine > initialization is fully done. Only bridges that were cold plugged are > considered. > > Set the hotplug handler for the root piix bus directly when realizing. > Overwrite the hotplug handler of bridges when hotplugging/coldplugging > them. > > This will now make sure that the ACPI PCI hotplug handler is also called > for cold-plugged devices (also on bridges) and for bridges that were > hotplugged. > > When trying to hotplug a device to a hotplugged bridge, we now correctly > get the error message > "Unsupported bus. Bus doesn't have property 'acpi-pcihp-bsel' set" > Insted of going via the standard PCI hotplug handler. Erroring out is probably not ok, since it can break existing setups where SHPC hotplugging to hotplugged bridge was working just fine before. Marcel/Michael what's your take on this change in behaviour? CCing libvirt in case they are doing this stuff > > Signed-off-by: David Hildenbrand > --- > hw/acpi/pcihp.c | 10 ++ > hw/acpi/piix4.c | 16 +--- > 2 files changed, 11 insertions(+), 15 deletions(-) > > diff --git a/hw/acpi/pcihp.c b/hw/acpi/pcihp.c > index 5e7cef173c..647b5e8d82 100644 > --- a/hw/acpi/pcihp.c > +++ b/hw/acpi/pcihp.c > @@ -30,6 +30,7 @@ > #include "hw/hw.h" > #include "hw/i386/pc.h" > #include "hw/pci/pci.h" > +#include "hw/pci/pci_bridge.h" > #include "hw/acpi/acpi.h" > #include "sysemu/sysemu.h" > #include "exec/address-spaces.h" > @@ -236,6 +237,15 @@ void acpi_pcihp_device_plug_cb(HotplugHandler > *hotplug_dev, AcpiPciHpState *s, > int slot = PCI_SLOT(pdev->devfn); > int bsel; > > +if (!s->legacy_piix && object_dynamic_cast(OBJECT(dev), > TYPE_PCI_BRIDGE)) { > +PCIBus *sec = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev)); > + > +/* Overwrite the default hotplug handler with the ACPI PCI one. */ > +qbus_set_hotplug_handler(BUS(sec), DEVICE(hotplug_dev), > &error_abort); > +/* No need to do it recursively */ > +assert(QLIST_EMPTY(&sec->child)); > +} > + > /* Don't send event when device is enabled during qemu machine creation: > * it is present on boot, no hotplug event is necessary. We do send an > * event when the device is disabled later. */ > diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c > index 167afe2cea..e611778daf 100644 > --- a/hw/acpi/piix4.c > +++ b/hw/acpi/piix4.c > @@ -446,15 +446,6 @@ static void piix4_device_unplug_cb(HotplugHandler > *hotplug_dev, > } > } > > -static void piix4_update_bus_hotplug(PCIBus *pci_bus, void *opaque) > -{ > -PIIX4PMState *s = opaque; > - > -/* pci_bus cannot outlive PIIX4PMState, because /machine keeps it alive > - * and it's not hot-unpluggable */ > -qbus_set_hotplug_handler(BUS(pci_bus), DEVICE(s), &error_abort); > -} > - > static void piix4_pm_machine_ready(Notifier *n, void *opaque) > { > PIIX4PMState *s = container_of(n, PIIX4PMState, machine_ready); > @@ -468,12 +459,6 @@ static void piix4_pm_machine_ready(Notifier *n, void > *opaque) > pci_conf[0x63] = 0x60; > pci_conf[0x67] = (memory_region_present(io_as, 0x3f8) ? 0x08 : 0) | > (memory_region_present(io_as, 0x2f8) ? 0x90 : 0); > - > -if (s->use_acpi_pci_hotplug) { > -pci_for_each_bus(pci_get_bus(d), piix4_update_bus_hotplug, s); > -} else { > -piix4_update_bus_hotplug(pci_get_bus(d), s); > -} > } > > static void piix4_pm_add_propeties(PIIX4PMState *s) > @@ -547,6 +532,7 @@ static void piix4_pm_realize(PCIDevice *dev, Error **errp) > > piix4_acpi_system_hot_add_init(pci_address_space_io(dev), > pci_get_bus(dev), s); > +qbus_set_hotplug_handler(BUS(pci_get_bus(dev)), DEVICE(s), &error_abort); > > piix4_pm_add_propeties(s); > } -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] No change in git since RC1 so no RC2 for libvirt 4.9.0
Since nothing was commited in git since RC1 it doesn't really make sense for me to push an RC2 at this point :-) So please continue testing RC1, I will check status over the w.e. as I plan to be offline for a bit more than a day, and depending on status either cut an RC2 or push directly 4.9.0 GA, thanks in advance, Daniel -- Daniel Veillard | Red Hat Developers Tools http://developer.redhat.com/ veill...@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/ -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2] qemu: Introduce caching whether /dev/kvm is accessible
On Mon, Oct 29, 2018 at 06:34:58PM +0100, Marc Hartmayer wrote: > Introduce caching whether /dev/kvm is usable as the QEMU user:QEMU > group. This reduces the overhead of the QEMU capabilities cache > lookup. Before this patch there were many fork() calls used for > checking whether /dev/kvm is accessible. Now we store the result > whether /dev/kvm is accessible or not and we only need to re-run the > virFileAccessibleAs check if the ctime of /dev/kvm has changed. > > Suggested-by: Daniel P. Berrangé > Signed-off-by: Marc Hartmayer > --- > src/qemu/qemu_capabilities.c | 54 ++-- > 1 file changed, 52 insertions(+), 2 deletions(-) For benefit of those who were not at KVM Forum last week, this patch is addressing the performance problem described starting here: https://youtu.be/H3lw50IKqGo?t=1055 Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] qemu: Report warning if QoS is set for vhostuser interface
On Thu, Nov 01, 2018 at 02:07:21PM +0100, Michal Privoznik wrote: > On 11/01/2018 10:33 AM, Erik Skultety wrote: > > On Wed, Oct 31, 2018 at 04:47:45PM +0100, Michal Privoznik wrote: > >> https://bugzilla.redhat.com/show_bug.cgi?id=1524230 > >> > >> Because of historical reasons, we are not denying starting a > >> domain which has QoS set for unsupported type of device. We do > >> report just a warning instead. And even though we perhaps used to > >> do so for vhostuser it got lost somewhere. Bring it back. > > > > So, if my blame-fu isn't flawed, then I don't think that vhostuser ever > > reported a warning during machine startup, have a look at commits > > 4a74ccdb92f > > and 0bce012d7f0 respectively - there always was a goto cleanup statement. > > > > Ah, so we did not report warning for vhostuser, but for everything else > we do. > > >> > >> Signed-off-by: Michal Privoznik > >> --- > >> src/qemu/qemu_command.c | 6 ++ > >> 1 file changed, 6 insertions(+) > >> > >> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c > >> index 6e3ff67660..489e8bc689 100644 > >> --- a/src/qemu/qemu_command.c > >> +++ b/src/qemu/qemu_command.c > >> @@ -8246,6 +8246,12 @@ qemuBuildVhostuserCommandLine(virQEMUDriverPtr > >> driver, > >> goto cleanup; > >> } > >> > >> +if (virDomainNetGetActualBandwidth(net)) { > >> +VIR_WARN("setting bandwidth on interfaces of " > >> + "type '%s' is not implemented yet", > >> + virDomainNetTypeToString(VIR_DOMAIN_NET_TYPE_VHOSTUSER)); > >> +} > >> + > > > > We already do this kind of checking on line 8593 in the same source file. > > It's > > just because of the goto cleanup jump after calling > > qemuBuildVhostuserCommandLine that causes the logic to never reach the > > condition > > (Not to mention the code is a mess). I'd suggest moving the already existing > > check to the top of the qemuBuildInterfaceCommandLine function instead of > > duplicating the code. > > I'm not sure I understand. The code looks something like this: > > switch (actualType) { > case VIR_DOMAIN_NET_TYPE_NETWORK: > case VIR_DOMAIN_NET_TYPE_BRIDGE: > createTapDevice(); > break; > > > case VIR_DOMAIN_NET_TYPE_VHOSTUSER: > ret = qemuBuildVhostuserCommandLine(); > goto cleanup; > } > > if (bandwidth) { > if (bandwidthSupported(actualType)) > setBandwidth(); > else > VIR_WARN(); > } > > Obviously, I can't just move 'if (bandwidth)' chunk before switch() > because setBandwidth() wouldn't have an interface to operate on (as it > is created in the switch). Or did you have something else on mind? Nah, my bad, I missed that there actually was the setBandwidth call. That basically leaves us with 2 options (if don't count your original patch) and both of them include some refactor: 1) extract the common bits of NET_TYPE_(DIRECT|ETHERNET|BRIDGE) to a helper, then extract the bandwidth bits to a helper and call this from the per-interface-type helpers 2) do a massive refactor of the whole interface-cmdline generating function, so that there wouldn't be short exit paths just like it is the case now for vhostuser so that in the end the bandwidth snippet we have can stay and will be applied to all the interface types without exception. Erik -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH libvirt-python] event-test.py: Report ERROR events
On 11/01/2018 11:20 AM, Philipp Hahn wrote: > VIR_DOMAIN_EVENT_ID_IO_ERROR and VIR_DOMAIN_EVENT_ID_IO_ERROR_REASON > callbacks receive the same 'action' parameter, so also translate that > numeric action to a descriptive text for the first callback. > > Signed-off-by: Philipp Hahn > --- > examples/event-test.py | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) ACKed and pushed. Michal -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] qemu: Report warning if QoS is set for vhostuser interface
On 11/01/2018 10:33 AM, Erik Skultety wrote: > On Wed, Oct 31, 2018 at 04:47:45PM +0100, Michal Privoznik wrote: >> https://bugzilla.redhat.com/show_bug.cgi?id=1524230 >> >> Because of historical reasons, we are not denying starting a >> domain which has QoS set for unsupported type of device. We do >> report just a warning instead. And even though we perhaps used to >> do so for vhostuser it got lost somewhere. Bring it back. > > So, if my blame-fu isn't flawed, then I don't think that vhostuser ever > reported a warning during machine startup, have a look at commits 4a74ccdb92f > and 0bce012d7f0 respectively - there always was a goto cleanup statement. > Ah, so we did not report warning for vhostuser, but for everything else we do. >> >> Signed-off-by: Michal Privoznik >> --- >> src/qemu/qemu_command.c | 6 ++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c >> index 6e3ff67660..489e8bc689 100644 >> --- a/src/qemu/qemu_command.c >> +++ b/src/qemu/qemu_command.c >> @@ -8246,6 +8246,12 @@ qemuBuildVhostuserCommandLine(virQEMUDriverPtr driver, >> goto cleanup; >> } >> >> +if (virDomainNetGetActualBandwidth(net)) { >> +VIR_WARN("setting bandwidth on interfaces of " >> + "type '%s' is not implemented yet", >> + virDomainNetTypeToString(VIR_DOMAIN_NET_TYPE_VHOSTUSER)); >> +} >> + > > We already do this kind of checking on line 8593 in the same source file. It's > just because of the goto cleanup jump after calling > qemuBuildVhostuserCommandLine that causes the logic to never reach the > condition > (Not to mention the code is a mess). I'd suggest moving the already existing > check to the top of the qemuBuildInterfaceCommandLine function instead of > duplicating the code. I'm not sure I understand. The code looks something like this: switch (actualType) { case VIR_DOMAIN_NET_TYPE_NETWORK: case VIR_DOMAIN_NET_TYPE_BRIDGE: createTapDevice(); break; case VIR_DOMAIN_NET_TYPE_VHOSTUSER: ret = qemuBuildVhostuserCommandLine(); goto cleanup; } if (bandwidth) { if (bandwidthSupported(actualType)) setBandwidth(); else VIR_WARN(); } Obviously, I can't just move 'if (bandwidth)' chunk before switch() because setBandwidth() wouldn't have an interface to operate on (as it is created in the switch). Or did you have something else on mind? Michal -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH libvirt-python] event-test.py: Fix ERROR event
On 11/01/2018 10:55 AM, Philipp Hahn wrote: > ERROR_EVENTS translates the numeric 'action' argument to a description, > not the 'reason' argument which already contains a descriptive string > like 'enospc'. > >> Traceback (most recent call last): >> File "/usr/lib/python2.7/dist-packages/libvirt.py", line 4661, in >> _dispatchDomainEventIOErrorReasonCallback >> reason, opaque) >> File "libvirt-python/examples/event-test.py", line 536, in >> myDomainEventIOErrorReasonCallback >> dom.name(), dom.ID(), srcpath, devalias, action, ERROR_EVENTS[reason])) >> File "libvirt-python/examples/event-test.py", line 474, in __getitem__ >> data = self.args[item] >> TypeError: tuple indices must be integers, not str > > Fixes: f5928c6711654f1496707ca77f626b3192843d57 > Signed-off-by: Philipp Hahn > --- > examples/event-test.py | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > ACKed and pushed. Michal -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH 7/7] tests: fix dry run handling in network firewall test
The networkxml2firewalltest sets virCommand to dry run mode but doesn't provide a callback to fill in stdout/stderr. As a result when the firewall code queries rules it gets a NULL output and so never triggers the callback to process output. We only need to return an empty string to make the firewall code work and thus trigger adding of the libvirt private chains to the builtin chains. Signed-off-by: Daniel P. Berrangé --- .../nat-default-linux.args| 48 +++ .../nat-ipv6-linux.args | 48 +++ .../nat-many-ips-linux.args | 48 +++ .../nat-no-dhcp-linux.args| 48 +++ .../nat-tftp-linux.args | 48 +++ .../route-default-linux.args | 48 +++ tests/networkxml2firewalltest.c | 16 ++- 7 files changed, 303 insertions(+), 1 deletion(-) diff --git a/tests/networkxml2firewalldata/nat-default-linux.args b/tests/networkxml2firewalldata/nat-default-linux.args index 69995181ad..e7d71817c7 100644 --- a/tests/networkxml2firewalldata/nat-default-linux.args +++ b/tests/networkxml2firewalldata/nat-default-linux.args @@ -72,6 +72,54 @@ ip6tables \ --list POSTROUTING iptables \ --table filter \ +--insert INPUT \ +--jump INP_libvirt +iptables \ +--table filter \ +--insert OUTPUT \ +--jump OUT_libvirt +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_out +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_in +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_cross +iptables \ +--table nat \ +--insert POSTROUTING \ +--jump PRT_libvirt +ip6tables \ +--table filter \ +--insert INPUT \ +--jump INP_libvirt +ip6tables \ +--table filter \ +--insert OUTPUT \ +--jump OUT_libvirt +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_out +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_in +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_cross +ip6tables \ +--table nat \ +--insert POSTROUTING \ +--jump PRT_libvirt +iptables \ +--table filter \ --insert INP_libvirt \ --in-interface virbr0 \ --protocol tcp \ diff --git a/tests/networkxml2firewalldata/nat-ipv6-linux.args b/tests/networkxml2firewalldata/nat-ipv6-linux.args index f93d8face2..620ebb8d14 100644 --- a/tests/networkxml2firewalldata/nat-ipv6-linux.args +++ b/tests/networkxml2firewalldata/nat-ipv6-linux.args @@ -72,6 +72,54 @@ ip6tables \ --list POSTROUTING iptables \ --table filter \ +--insert INPUT \ +--jump INP_libvirt +iptables \ +--table filter \ +--insert OUTPUT \ +--jump OUT_libvirt +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_out +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_in +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_cross +iptables \ +--table nat \ +--insert POSTROUTING \ +--jump PRT_libvirt +ip6tables \ +--table filter \ +--insert INPUT \ +--jump INP_libvirt +ip6tables \ +--table filter \ +--insert OUTPUT \ +--jump OUT_libvirt +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_out +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_in +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_cross +ip6tables \ +--table nat \ +--insert POSTROUTING \ +--jump PRT_libvirt +iptables \ +--table filter \ --insert INP_libvirt \ --in-interface virbr0 \ --protocol tcp \ diff --git a/tests/networkxml2firewalldata/nat-many-ips-linux.args b/tests/networkxml2firewalldata/nat-many-ips-linux.args index faae4b881c..7c378b8c7e 100644 --- a/tests/networkxml2firewalldata/nat-many-ips-linux.args +++ b/tests/networkxml2firewalldata/nat-many-ips-linux.args @@ -72,6 +72,54 @@ ip6tables \ --list POSTROUTING iptables \ --table filter \ +--insert INPUT \ +--jump INP_libvirt +iptables \ +--table filter \ +--insert OUTPUT \ +--jump OUT_libvirt +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_out +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_in +iptables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_cross +iptables \ +--table nat \ +--insert POSTROUTING \ +--jump PRT_libvirt +ip6tables \ +--table filter \ +--insert INPUT \ +--jump INP_libvirt +ip6tables \ +--table filter \ +--insert OUTPUT \ +--jump OUT_libvirt +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_out +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_in +ip6tables \ +--table filter \ +--insert FORWARD \ +--jump FWD_libvirt_cross +ip6tables \ +--table nat \ +--insert POSTROUTING \ +--jump PRT_libvirt +iptables \ +--table filter \ --insert INP_libvirt \ --in-interface virbr0 \ --protocol tcp \ diff --git a/tests/networkxml2firewalldata/nat-no-dhcp-linux.args b/tests/networkxml2firewalldata/nat-no-dhcp-linux.args index cb0d908506..afa8c3a0ca 100644
[libvirt] [PATCH 4/7] network: setup default iptables chains
Register the default chains that will be used to hold firewall rules at network startup. Signed-off-by: Daniel P. Berrangé --- src/network/bridge_driver_linux.c | 3 + .../nat-default-linux.args| 72 +++ .../nat-ipv6-linux.args | 72 +++ .../nat-many-ips-linux.args | 72 +++ .../nat-no-dhcp-linux.args| 72 +++ .../nat-tftp-linux.args | 72 +++ .../route-default-linux.args | 72 +++ 7 files changed, 435 insertions(+) diff --git a/src/network/bridge_driver_linux.c b/src/network/bridge_driver_linux.c index fb09954b8f..6992653b4a 100644 --- a/src/network/bridge_driver_linux.c +++ b/src/network/bridge_driver_linux.c @@ -640,6 +640,9 @@ int networkAddFirewallRules(virNetworkDefPtr def) virFirewallPtr fw = NULL; int ret = -1; +if (iptablesSetupPrivateChains() < 0) +return -1; + fw = virFirewallNew(); virFirewallStartTransaction(fw, 0); diff --git a/tests/networkxml2firewalldata/nat-default-linux.args b/tests/networkxml2firewalldata/nat-default-linux.args index ffdafdff0e..9928da715b 100644 --- a/tests/networkxml2firewalldata/nat-default-linux.args +++ b/tests/networkxml2firewalldata/nat-default-linux.args @@ -1,5 +1,77 @@ iptables \ --table filter \ +--new-chain INP_libvirt +iptables \ +--table filter \ +--new-chain OUT_libvirt +iptables \ +--table filter \ +--new-chain FWD_libvirt_out +iptables \ +--table filter \ +--new-chain FWD_libvirt_in +iptables \ +--table filter \ +--new-chain FWD_libvirt_cross +iptables \ +--table nat \ +--new-chain PRT_libvirt +ip6tables \ +--table filter \ +--new-chain INP_libvirt +ip6tables \ +--table filter \ +--new-chain OUT_libvirt +ip6tables \ +--table filter \ +--new-chain FWD_libvirt_out +ip6tables \ +--table filter \ +--new-chain FWD_libvirt_in +ip6tables \ +--table filter \ +--new-chain FWD_libvirt_cross +ip6tables \ +--table nat \ +--new-chain PRT_libvirt +iptables \ +--table filter \ +--list INPUT +iptables \ +--table filter \ +--list OUTPUT +iptables \ +--table filter \ +--list FORWARD +iptables \ +--table filter \ +--list FORWARD +iptables \ +--table filter \ +--list FORWARD +iptables \ +--table nat \ +--list POSTROUTING +ip6tables \ +--table filter \ +--list INPUT +ip6tables \ +--table filter \ +--list OUTPUT +ip6tables \ +--table filter \ +--list FORWARD +ip6tables \ +--table filter \ +--list FORWARD +ip6tables \ +--table filter \ +--list FORWARD +ip6tables \ +--table nat \ +--list POSTROUTING +iptables \ +--table filter \ --insert INPUT \ --in-interface virbr0 \ --protocol tcp \ diff --git a/tests/networkxml2firewalldata/nat-ipv6-linux.args b/tests/networkxml2firewalldata/nat-ipv6-linux.args index 22285afa10..440896de18 100644 --- a/tests/networkxml2firewalldata/nat-ipv6-linux.args +++ b/tests/networkxml2firewalldata/nat-ipv6-linux.args @@ -1,5 +1,77 @@ iptables \ --table filter \ +--new-chain INP_libvirt +iptables \ +--table filter \ +--new-chain OUT_libvirt +iptables \ +--table filter \ +--new-chain FWD_libvirt_out +iptables \ +--table filter \ +--new-chain FWD_libvirt_in +iptables \ +--table filter \ +--new-chain FWD_libvirt_cross +iptables \ +--table nat \ +--new-chain PRT_libvirt +ip6tables \ +--table filter \ +--new-chain INP_libvirt +ip6tables \ +--table filter \ +--new-chain OUT_libvirt +ip6tables \ +--table filter \ +--new-chain FWD_libvirt_out +ip6tables \ +--table filter \ +--new-chain FWD_libvirt_in +ip6tables \ +--table filter \ +--new-chain FWD_libvirt_cross +ip6tables \ +--table nat \ +--new-chain PRT_libvirt +iptables \ +--table filter \ +--list INPUT +iptables \ +--table filter \ +--list OUTPUT +iptables \ +--table filter \ +--list FORWARD +iptables \ +--table filter \ +--list FORWARD +iptables \ +--table filter \ +--list FORWARD +iptables \ +--table nat \ +--list POSTROUTING +ip6tables \ +--table filter \ +--list INPUT +ip6tables \ +--table filter \ +--list OUTPUT +ip6tables \ +--table filter \ +--list FORWARD +ip6tables \ +--table filter \ +--list FORWARD +ip6tables \ +--table filter \ +--list FORWARD +ip6tables \ +--table nat \ +--list POSTROUTING +iptables \ +--table filter \ --insert INPUT \ --in-interface virbr0 \ --protocol tcp \ diff --git a/tests/networkxml2firewalldata/nat-many-ips-linux.args b/tests/networkxml2firewalldata/nat-many-ips-linux.args index aff9f69664..d80a9551d4 100644 --- a/tests/networkxml2firewalldata/nat-many-ips-linux.args +++ b/tests/networkxml2firewalldata/nat-many-ips-linux.args @@ -1,5 +1,77 @@ iptables \ --table filter \ +--new-chain INP_libvirt +iptables \ +--table filter \ +--new-chain OUT_libvirt +iptables \ +--table filter \ +--new-chain FWD_libvirt_out +iptables \ +--table filter \ +--new-chain FWD_libvirt_in +iptables \ +--table filter \ +--new-chain FWD_libvirt_cross +iptables \ +--table nat \ +--new-chain PRT_libvirt +ip6tables \ +--
[libvirt] [PATCH 6/7] tests: remove duplicated test case in networkxml2firewalltest
Signed-off-by: Daniel P. Berrangé --- tests/networkxml2firewalltest.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tests/networkxml2firewalltest.c b/tests/networkxml2firewalltest.c index 242b645767..505ff0c740 100644 --- a/tests/networkxml2firewalltest.c +++ b/tests/networkxml2firewalltest.c @@ -154,7 +154,6 @@ mymain(void) DO_TEST("nat-no-dhcp"); DO_TEST("nat-ipv6"); DO_TEST("route-default"); -DO_TEST("route-default"); cleanup: return ret == 0 ? EXIT_SUCCESS : EXIT_FAILURE; -- 2.19.1 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH 5/7] util: switch over to creating rules in private chains
All rules are now created in the libvirt private firewall chains. The code for deleting rules will try to delete from both the original builtin chains and the new private chains in order to cleanup properly during upgrades. This finally fixes a very old bug (from 2008!) related to traffic between guests on distinct virtual networks. The intention is that networks never allow incoming connections, but the old ordering of rules meant that we would mistakenly allow accept traffic from whichever network was most recently created. With everything going into the FORWARD chain there was interleaving of rules for outbound traffic and inbound traffic for each network: ACCEPT all -- * virbr2 0.0.0.0/0192.168.123.0/24 ctstate RELATED,ESTABLISHED ACCEPT all -- virbr2 * 192.168.123.0/24 0.0.0.0/0 ACCEPT all -- virbr2 virbr2 0.0.0.0/00.0.0.0/0 REJECT all -- * virbr2 0.0.0.0/00.0.0.0/0 reject-with icmp-port-unreachable REJECT all -- virbr2 * 0.0.0.0/00.0.0.0/0 reject-with icmp-port-unreachable ACCEPT all -- * virbr0 0.0.0.0/0192.168.122.0/24 ctstate RELATED,ESTABLISHED ACCEPT all -- virbr0 * 192.168.122.0/24 0.0.0.0/0 ACCEPT all -- virbr0 virbr0 0.0.0.0/00.0.0.0/0 REJECT all -- * virbr0 0.0.0.0/00.0.0.0/0 reject-with icmp-port-unreachable REJECT all -- virbr0 * 0.0.0.0/00.0.0.0/0 reject-with icmp-port-unreachable So the rule allowing outbound traffic from virbr2 would mistakenly allow packets from virbr2 to virbr0, before the rule denying input to virbr0 gets a chance to run With the split up forwarding chains, all incoming deny rules are checked before any of the outgoing allow rules, as rules are grouped into three distinct sets Cross rules ACCEPT all -- virbr2 virbr2 0.0.0.0/00.0.0.0/0 ACCEPT all -- virbr0 virbr0 0.0.0.0/00.0.0.0/0 Incoming rules ACCEPT all -- * virbr2 0.0.0.0/0192.168.123.0/24 ctstate RELATED,ESTABLISHED ACCEPT all -- * virbr0 0.0.0.0/0192.168.122.0/24 ctstate RELATED,ESTABLISHED REJECT all -- * virbr2 0.0.0.0/00.0.0.0/0 reject-with icmp-port-unreachable REJECT all -- * virbr0 0.0.0.0/00.0.0.0/0 reject-with icmp-port-unreachable Outgoing rules ACCEPT all -- virbr2 * 192.168.123.0/24 0.0.0.0/0 REJECT all -- virbr2 * 0.0.0.0/00.0.0.0/0 reject-with icmp-port-unreachable ACCEPT all -- virbr0 * 192.168.122.0/24 0.0.0.0/0 REJECT all -- virbr0 * 0.0.0.0/00.0.0.0/0 reject-with icmp-port-unreachable Signed-off-by: Daniel P. Berrangé --- src/util/viriptables.c| 71 +-- .../nat-default-linux.args| 32 - .../nat-ipv6-linux.args | 48 ++--- .../nat-many-ips-linux.args | 60 .../nat-no-dhcp-linux.args| 46 ++-- .../nat-tftp-linux.args | 34 - .../route-default-linux.args | 22 +++--- 7 files changed, 171 insertions(+), 142 deletions(-) diff --git a/src/util/viriptables.c b/src/util/viriptables.c index b4a4bf9a12..ad029e6465 100644 --- a/src/util/viriptables.c +++ b/src/util/viriptables.c @@ -209,7 +209,7 @@ iptablesAddTcpInput(virFirewallPtr fw, const char *iface, int port) { -iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, ADD, 1); +iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_PRIVATE, iface, port, ADD, 1); } /** @@ -228,6 +228,7 @@ iptablesRemoveTcpInput(virFirewallPtr fw, int port) { iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, REMOVE, 1); +iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_PRIVATE, iface, port, REMOVE, 1); } /** @@ -245,7 +246,7 @@ iptablesAddUdpInput(virFirewallPtr fw, const char *iface, int port) { -iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, ADD, 0); +iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_PRIVATE, iface, port, ADD, 0); } /** @@ -263,7 +264,8 @@ iptablesRemoveUdpInput(virFirewallPtr fw, const char *iface, int port) { -return iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, REMOVE, 0); +iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, REMOVE, 0); +iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_PRIVATE, iface, port, REMOVE, 0); } /** @@ -281,7 +283,7 @@ iptablesAddUdpOutput(virFirewallPtr fw,
[libvirt] [PATCH 2/7] util: add iptables API for creating base chains
Historically rules were added straight into the base chains. This works but it is inflexible for admins adding extra rules via hook scripts, and it is not clear which rules are libvirt created. There is a further complexity with the FORWARD chain where a specific ordering of rules is needed to ensure traffic is matched correctly. This would require complex interleaving of rules instead of plain appending. By splitting the FORWARD chain into three chains management will be simpler. Thus we create INPUT -> INP_libvirt OUTPUT -> OUT_libvirt FORWARD -> FWD_libvirt_cross FORWARD -> FWD_libvirt_in FORWARD -> FWD_libvirt_out POSTROUTING -> PRT_libvirt Signed-off-by: Daniel P. Berrangé --- src/libvirt_private.syms | 1 + src/util/viriptables.c | 81 src/util/viriptables.h | 2 + 3 files changed, 84 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 335210c31d..e42c946de6 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2062,6 +2062,7 @@ iptablesRemoveOutputFixUdpChecksum; iptablesRemoveTcpInput; iptablesRemoveUdpInput; iptablesRemoveUdpOutput; +iptablesSetupPrivateChains; # util/viriscsi.h diff --git a/src/util/viriptables.c b/src/util/viriptables.c index f379844d28..4a7ea54b38 100644 --- a/src/util/viriptables.c +++ b/src/util/viriptables.c @@ -51,6 +51,87 @@ enum { }; + +typedef struct { +virFirewallLayer layer; +const char *table; +const char *parent; +const char *child; +} iptablesChain; + +static int +iptablesCheckPrivateChain(virFirewallPtr fw, + const char *const *lines, + void *opaque) +{ +iptablesChain *data = opaque; +bool found = false; + +while (lines && *lines && !found) { +if (STRPREFIX(*lines, data->child)) +found = true; +lines++; +} + +if (!found) +virFirewallAddRule(fw, data->layer, + "--table", data->table, + "--insert", data->parent, + "--jump", data->child, NULL); + + return 0; +} + + +int +iptablesSetupPrivateChains(void) +{ +virFirewallPtr fw; +int ret = -1; +iptablesChain chains[] = { +{VIR_FIREWALL_LAYER_IPV4, "filter", "INPUT", "INP_libvirt"}, +{VIR_FIREWALL_LAYER_IPV4, "filter", "OUTPUT", "OUT_libvirt"}, +{VIR_FIREWALL_LAYER_IPV4, "filter", "FORWARD", "FWD_libvirt_out"}, +{VIR_FIREWALL_LAYER_IPV4, "filter", "FORWARD", "FWD_libvirt_in"}, +{VIR_FIREWALL_LAYER_IPV4, "filter", "FORWARD", "FWD_libvirt_cross"}, +{VIR_FIREWALL_LAYER_IPV4, "nat", "POSTROUTING", "PRT_libvirt"}, +{VIR_FIREWALL_LAYER_IPV6, "filter", "INPUT", "INP_libvirt"}, +{VIR_FIREWALL_LAYER_IPV6, "filter", "OUTPUT", "OUT_libvirt"}, +{VIR_FIREWALL_LAYER_IPV6, "filter", "FORWARD", "FWD_libvirt_out"}, +{VIR_FIREWALL_LAYER_IPV6, "filter", "FORWARD", "FWD_libvirt_in"}, +{VIR_FIREWALL_LAYER_IPV6, "filter", "FORWARD", "FWD_libvirt_cross"}, +{VIR_FIREWALL_LAYER_IPV6, "nat", "POSTROUTING", "PRT_libvirt"}, +}; +size_t i; + +fw = virFirewallNew(); + +virFirewallStartTransaction(fw, VIR_FIREWALL_TRANSACTION_IGNORE_ERRORS); + +for (i = 0; i < ARRAY_CARDINALITY(chains); i++) { +virFirewallAddRule(fw, chains[i].layer, + "--table", chains[i].table, + "--new-chain", chains[i].child, NULL); +} + +virFirewallStartTransaction(fw, 0); + +for (i = 0; i < ARRAY_CARDINALITY(chains); i++) { +virFirewallAddRuleFull(fw, chains[i].layer, + false, iptablesCheckPrivateChain, + &chains[i], + "--table", chains[i].table, + "--list", chains[i].parent, NULL); +} + +if (virFirewallApply(fw) < 0) +goto cleanup; + +ret = 0; + cleanup: +return ret; +} + static void iptablesInput(virFirewallPtr fw, virFirewallLayer layer, diff --git a/src/util/viriptables.h b/src/util/viriptables.h index 9ea25fc096..1db97937a1 100644 --- a/src/util/viriptables.h +++ b/src/util/viriptables.h @@ -27,6 +27,8 @@ # include "virsocketaddr.h" # include "virfirewall.h" +int iptablesSetupPrivateChains (void); + void iptablesAddTcpInput (virFirewallPtr fw, virFirewallLayer layer, const char *iface, -- 2.19.1 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH 3/7] util: prepare iptables for putting rules into private chains
Currently all rules are created directly in the INPUT, FORWARD, OUTPUT and POSTROUTING chains. This change prepares for putting the rules into private changes, but does not actually do the switch yet. Signed-off-by: Daniel P. Berrangé --- src/util/viriptables.c | 152 + 1 file changed, 108 insertions(+), 44 deletions(-) diff --git a/src/util/viriptables.c b/src/util/viriptables.c index 4a7ea54b38..b4a4bf9a12 100644 --- a/src/util/viriptables.c +++ b/src/util/viriptables.c @@ -50,6 +50,12 @@ enum { REMOVE }; +enum { +VIR_IPTABLES_CHAIN_BUILTIN, +VIR_IPTABLES_CHAIN_PRIVATE, + +VIR_IPTABLES_CHAIN_LAST, +}; typedef struct { @@ -135,19 +141,24 @@ iptablesSetupPrivateChains(void) static void iptablesInput(virFirewallPtr fw, virFirewallLayer layer, + int chain, const char *iface, int port, int action, int tcp) { char portstr[32]; +static const char *chainName[VIR_IPTABLES_CHAIN_LAST] = { +"INPUT", +"INP_libvirt", +}; snprintf(portstr, sizeof(portstr), "%d", port); portstr[sizeof(portstr) - 1] = '\0'; virFirewallAddRule(fw, layer, "--table", "filter", - action == ADD ? "--insert" : "--delete", "INPUT", + action == ADD ? "--insert" : "--delete", chainName[chain], "--in-interface", iface, "--protocol", tcp ? "tcp" : "udp", "--destination-port", portstr, @@ -158,19 +169,24 @@ iptablesInput(virFirewallPtr fw, static void iptablesOutput(virFirewallPtr fw, virFirewallLayer layer, + int chain, const char *iface, int port, int action, int tcp) { char portstr[32]; +static const char *chainName[VIR_IPTABLES_CHAIN_LAST] = { +"OUTPUT", +"OUT_libvirt", +}; snprintf(portstr, sizeof(portstr), "%d", port); portstr[sizeof(portstr) - 1] = '\0'; virFirewallAddRule(fw, layer, "--table", "filter", - action == ADD ? "--insert" : "--delete", "OUTPUT", + action == ADD ? "--insert" : "--delete", chainName[chain], "--out-interface", iface, "--protocol", tcp ? "tcp" : "udp", "--destination-port", portstr, @@ -193,7 +209,7 @@ iptablesAddTcpInput(virFirewallPtr fw, const char *iface, int port) { -iptablesInput(fw, layer, iface, port, ADD, 1); +iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, ADD, 1); } /** @@ -211,7 +227,7 @@ iptablesRemoveTcpInput(virFirewallPtr fw, const char *iface, int port) { -iptablesInput(fw, layer, iface, port, REMOVE, 1); +iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, REMOVE, 1); } /** @@ -229,7 +245,7 @@ iptablesAddUdpInput(virFirewallPtr fw, const char *iface, int port) { -iptablesInput(fw, layer, iface, port, ADD, 0); +iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, ADD, 0); } /** @@ -247,7 +263,7 @@ iptablesRemoveUdpInput(virFirewallPtr fw, const char *iface, int port) { -return iptablesInput(fw, layer, iface, port, REMOVE, 0); +return iptablesInput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, REMOVE, 0); } /** @@ -265,7 +281,7 @@ iptablesAddUdpOutput(virFirewallPtr fw, const char *iface, int port) { -iptablesOutput(fw, layer, iface, port, ADD, 0); +iptablesOutput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, ADD, 0); } /** @@ -283,7 +299,7 @@ iptablesRemoveUdpOutput(virFirewallPtr fw, const char *iface, int port) { -iptablesOutput(fw, layer, iface, port, REMOVE, 0); +iptablesOutput(fw, layer, VIR_IPTABLES_CHAIN_BUILTIN, iface, port, REMOVE, 0); } @@ -323,6 +339,7 @@ static char *iptablesFormatNetwork(virSocketAddr *netaddr, */ static int iptablesForwardAllowOut(virFirewallPtr fw, +int chain, virSocketAddr *netaddr, unsigned int prefix, const char *iface, @@ -332,6 +349,10 @@ iptablesForwardAllowOut(virFirewallPtr fw, VIR_AUTOFREE(char *) networkstr = NULL; virFirewallLayer layer = VIR_SOCKET_ADDR_FAMILY(netaddr) == AF_INET ? VIR_FIREWALL_LAYER_IPV4 : VIR_FIREWALL_LAYER_IPV6; +static const char *chainName[VIR_IPTABLES_CHAIN_LAST] = { +"FORWARD", +"FWD_libvirt_out", +}; if (!(networkstr = iptablesFormatNetwork(netaddr, prefix
[libvirt] [PATCH 1/7] util: refactor iptables APIs to share more code
Most of the iptables APIs share code for the add/delete paths, but a couple were separated. Merge the remaining APIs to facilitate future changes. Signed-off-by: Daniel P. Berrangé --- src/util/viriptables.c | 73 -- 1 file changed, 42 insertions(+), 31 deletions(-) diff --git a/src/util/viriptables.c b/src/util/viriptables.c index 5dbea8cf57..f379844d28 100644 --- a/src/util/viriptables.c +++ b/src/util/viriptables.c @@ -495,6 +495,21 @@ iptablesRemoveForwardAllowIn(virFirewallPtr fw, return iptablesForwardAllowIn(fw, netaddr, prefix, iface, physdev, REMOVE); } +static void +iptablesForwardAllowCross(virFirewallPtr fw, + virFirewallLayer layer, + const char *iface, + int action) +{ +virFirewallAddRule(fw, layer, + "--table", "filter", + action == ADD ? "--insert" : "--delete", "FORWARD", + "--in-interface", iface, + "--out-interface", iface, + "--jump", "ACCEPT", + NULL); +} + /** * iptablesAddForwardAllowCross: * @ctx: pointer to the IP table context @@ -511,13 +526,7 @@ iptablesAddForwardAllowCross(virFirewallPtr fw, virFirewallLayer layer, const char *iface) { -virFirewallAddRule(fw, layer, - "--table", "filter", - "--insert", "FORWARD", - "--in-interface", iface, - "--out-interface", iface, - "--jump", "ACCEPT", - NULL); +iptablesForwardAllowCross(fw, layer, iface, ADD); } /** @@ -535,13 +544,21 @@ void iptablesRemoveForwardAllowCross(virFirewallPtr fw, virFirewallLayer layer, const char *iface) +{ +iptablesForwardAllowCross(fw, layer, iface, REMOVE); +} + +static void +iptablesForwardRejectOut(virFirewallPtr fw, + virFirewallLayer layer, + const char *iface, + int action) { virFirewallAddRule(fw, layer, "--table", "filter", - "--delete", "FORWARD", + action == ADD ? "--insert" : "delete", "FORWARD", "--in-interface", iface, - "--out-interface", iface, - "--jump", "ACCEPT", + "--jump", "REJECT", NULL); } @@ -560,12 +577,7 @@ iptablesAddForwardRejectOut(virFirewallPtr fw, virFirewallLayer layer, const char *iface) { -virFirewallAddRule(fw, layer, - "--table", "filter", - "--insert", "FORWARD", - "--in-interface", iface, - "--jump", "REJECT", - NULL); +iptablesForwardRejectOut(fw, layer, iface, ADD); } /** @@ -582,16 +594,25 @@ void iptablesRemoveForwardRejectOut(virFirewallPtr fw, virFirewallLayer layer, const char *iface) +{ +iptablesForwardRejectOut(fw, layer, iface, REMOVE); +} + + +static void +iptablesForwardRejectIn(virFirewallPtr fw, +virFirewallLayer layer, +const char *iface, +int action) { virFirewallAddRule(fw, layer, "--table", "filter", - "--delete", "FORWARD", - "--in-interface", iface, + action == ADD ? "--insert" : "--delete", "FORWARD", + "--out-interface", iface, "--jump", "REJECT", NULL); } - /** * iptablesAddForwardRejectIn: * @ctx: pointer to the IP table context @@ -607,12 +628,7 @@ iptablesAddForwardRejectIn(virFirewallPtr fw, virFirewallLayer layer, const char *iface) { -virFirewallAddRule(fw, layer, - "--table", "filter", - "--insert", "FORWARD", - "--out-interface", iface, - "--jump", "REJECT", - NULL); +iptablesForwardRejectIn(fw, layer, iface, ADD); } /** @@ -630,12 +646,7 @@ iptablesRemoveForwardRejectIn(virFirewallPtr fw, virFirewallLayer layer, const char *iface) { -virFirewallAddRule(fw, layer, - "--table", "filter", - "--delete", "FORWARD", - "--out-interface", iface, - "--jump", "REJECT", - NULL); +iptablesForwardRejectIn(fw, layer, iface
[libvirt] [PATCH 0/7] Restructure firewall rules for virtual networks into private chains
The virtual networks in NAT mode are supposed to only allow outbound network access for guests. Unfortunately due to ordering of the firewall rules libvirt creates, when you have multiple virtual networks, guests on the more recently created virtual networks can connect to guests on old virtual networks. This was reported way back in 2008 but we always thought the fix would be very complicated to deal with, so we've been putting it off forever. In parallel with this there's also been a long standing desire since 2009 to move our firewall rules out of the builtin chains, to libvirt private chains. This is to make it easier for admins to use hook scripts to setup rules in the builtin chains that take priority over rules libvirt creates. In implementing the changes to use private chains, I suddenly realized that fixing the network to network traffic blocking problem was trivial if I grouped the forwarding rules into three distinct sets. So this series finally fixes an annoying 10 year old bug, and implements a 9 year old RFE. It may take us a while, but we'll get to your bugs eventually ;-) Daniel P. Berrangé (7): util: refactor iptables APIs to share more code util: add iptables API for creating base chains util: prepare iptables for putting rules into private chains network: setup default iptables chains util: switch over to creating rules in private chains tests: remove duplicated test case in networkxml2firewalltest tests: fix dry run handling in network firewall test src/libvirt_private.syms | 1 + src/network/bridge_driver_linux.c | 3 + src/util/viriptables.c| 317 ++ src/util/viriptables.h| 2 + .../nat-default-linux.args| 150 - .../nat-ipv6-linux.args | 166 +++-- .../nat-many-ips-linux.args | 178 -- .../nat-no-dhcp-linux.args| 164 +++-- .../nat-tftp-linux.args | 152 - .../route-default-linux.args | 140 +++- tests/networkxml2firewalltest.c | 17 +- 11 files changed, 1107 insertions(+), 183 deletions(-) -- 2.19.1 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH 1/3] qemu: caps: add drive.qcow2.l2-cache-size
Signed-off-by: Nikolay Shirokovskiy --- src/qemu/qemu_capabilities.c | 3 +++ src/qemu/qemu_capabilities.h | 1 + tests/qemucapabilitiesdata/caps_2.10.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.10.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_2.10.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.10.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.11.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.11.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.5.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.6.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.6.0.ppc64.xml| 1 + tests/qemucapabilitiesdata/caps_2.6.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.7.0.s390x.xml| 1 + tests/qemucapabilitiesdata/caps_2.7.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.8.0.s390x.xml| 1 + tests/qemucapabilitiesdata/caps_2.8.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.9.0.ppc64.xml| 1 + tests/qemucapabilitiesdata/caps_2.9.0.s390x.xml| 1 + tests/qemucapabilitiesdata/caps_2.9.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml| 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml| 1 + tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml | 1 + 28 files changed, 30 insertions(+) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index e228f52..7d42254 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -508,6 +508,7 @@ VIR_ENUM_IMPL(virQEMUCaps, QEMU_CAPS_LAST, /* 315 */ "vfio-pci.display", "blockdev", + "drive.qcow2.l2-cache-size", ); @@ -1234,6 +1235,8 @@ static struct virQEMUCapsStringFlags virQEMUCapsQMPSchemaQueries[] = { { "blockdev-add/arg-type/+vxhs", QEMU_CAPS_VXHS}, { "blockdev-add/arg-type/+iscsi/password-secret", QEMU_CAPS_ISCSI_PASSWORD_SECRET }, { "blockdev-add/arg-type/+qcow2/encrypt/+luks/key-secret", QEMU_CAPS_QCOW2_LUKS }, +{ "blockdev-add/arg-type/options/+qcow2/l2-cache-size", QEMU_CAPS_DRIVE_QCOW2_L2_CACHE_SIZE}, +{ "blockdev-add/arg-type/+qcow2/l2-cache-size", QEMU_CAPS_DRIVE_QCOW2_L2_CACHE_SIZE}, { "nbd-server-start/arg-type/tls-creds", QEMU_CAPS_NBD_TLS }, { "screendump/arg-type/device", QEMU_CAPS_SCREENDUMP_DEVICE }, { "block-commit/arg-type/*top", QEMU_CAPS_ACTIVE_COMMIT }, diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 934620e..eb9b98b 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -492,6 +492,7 @@ typedef enum { /* virQEMUCapsFlags grouping marker for syntax-check */ /* 315 */ QEMU_CAPS_VFIO_PCI_DISPLAY, /* -device vfio-pci.display */ QEMU_CAPS_BLOCKDEV, /* -blockdev and blockdev-add are supported */ +QEMU_CAPS_DRIVE_QCOW2_L2_CACHE_SIZE, /* -drive l2-cache-size */ QEMU_CAPS_LAST /* this must always be the last item */ } virQEMUCapsFlags; diff --git a/tests/qemucapabilitiesdata/caps_2.10.0.aarch64.xml b/tests/qemucapabilitiesdata/caps_2.10.0.aarch64.xml index b9c4182..1214a48 100644 --- a/tests/qemucapabilitiesdata/caps_2.10.0.aarch64.xml +++ b/tests/qemucapabilitiesdata/caps_2.10.0.aarch64.xml @@ -151,6 +151,7 @@ + 201 0 305067 diff --git a/tests/qemucapabilitiesdata/caps_2.10.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_2.10.0.ppc64.xml index 66b2560..defbd6d 100644 --- a/tests/qemucapabilitiesdata/caps_2.10.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_2.10.0.ppc64.xml @@ -150,6 +150,7 @@ + 201 0 384412 diff --git a/tests/qemucapabilitiesdata/caps_2.10.0.s390x.xml b/tests/qemucapabilitiesdata/caps_2.10.0.s390x.xml index e000aac..3dfb5b3 100644 --- a/tests/qemucapabilitiesdata/caps_2.10.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_2.10.0.s390x.xml @@ -113,6 +113,7 @@ + 201 0 306247 diff --git a/tests/qemucapabilitiesdata/caps_2.10.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_2.10.0.x86_64.xml index ebc5e77..adf9cd0 100644 --- a/tests/qemucapabilitiesdata/caps_2.10.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_2.10.0.x86_64.xml @@ -192,6 +192,7 @@ + 201 0 364386 diff --git a/tests/qemucapabilitiesdata/caps_2.11.0.s390x.xml b/tests/qemucapabilitiesdata/caps_2.11.0.s390x.xml index 4eb8a39..d115424 100644 --- a/tests/qemucapabilitiesdata/caps_2.11.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_2.11.0.s390x.xml @@ -120,6 +120,7 @@ + 2011000 0 345099 diff --git a/tests/qemucapabilities
[libvirt] [PATCH 2/3] xml: add disk driver metadata_cache_size option
The only possible value is 'maximum' which makes l2_cache_size large enough to keep all metadata in memory. This will unleash disks performace for some database workloads and IO benchmarks with random access to whole disk. Note that imlementation sets l2-cache-size and not cache-size. Both *cache-size's is upper limit on cache size value thus instead of setting precise limit for disk which involves knowing disk size and disk's cluster size we can just set INT64_MAX. Unfortunately both old and new versions of qemu fail on setting cache-size to INT64_MAX. Fortunately both old and new versions works well on such setting for l2-cache-size. As guest performance depends only l2 cache size and not refcount cache size (which is documented in recent qemu) we can set l2 directly. Signed-off-by: Nikolay Shirokovskiy --- docs/formatdomain.html.in | 7 docs/schemas/domaincommon.rng | 10 + src/conf/domain_conf.c | 17 src/conf/domain_conf.h | 9 src/qemu/qemu_command.c| 26 src/qemu/qemu_domain.c | 1 + .../qemuxml2argvdata/disk-metadata_cache_size.args | 34 +++ .../qemuxml2argvdata/disk-metadata_cache_size.xml | 42 +++ tests/qemuxml2argvtest.c | 2 + .../disk-metadata_cache_size.xml | 48 ++ tests/qemuxml2xmltest.c| 2 + 11 files changed, 198 insertions(+) create mode 100644 tests/qemuxml2argvdata/disk-metadata_cache_size.args create mode 100644 tests/qemuxml2argvdata/disk-metadata_cache_size.xml create mode 100644 tests/qemuxml2xmloutdata/disk-metadata_cache_size.xml diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 8189959..93e0009 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -3556,6 +3556,13 @@ virt queues for virtio-blk. (Since 3.9.0) +The optional metadata_cache_size attribute specifies +metadata cache size policy. The only possible value is "maximum" to +keep all metadata in cache, this will help if workload needs access +to whole disk all the time. (Since +4.9.0) + + For virtio disks, Virtio-specific options can also be set. (Since 3.5.0) diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 099a949..18efa3a 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -1990,6 +1990,9 @@ + + + @@ -2090,6 +2093,13 @@ + + + +maximum + + + diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index e8e0adc..04383f0 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -885,6 +885,11 @@ VIR_ENUM_IMPL(virDomainDiskDetectZeroes, VIR_DOMAIN_DISK_DETECT_ZEROES_LAST, "on", "unmap") +VIR_ENUM_IMPL(virDomainDiskMetadataCacheSize, + VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_LAST, + "default", + "maximum") + VIR_ENUM_IMPL(virDomainDiskMirrorState, VIR_DOMAIN_DISK_MIRROR_STATE_LAST, "none", "yes", @@ -9347,6 +9352,14 @@ virDomainDiskDefDriverParseXML(virDomainDiskDefPtr def, } VIR_FREE(tmp); +if ((tmp = virXMLPropString(cur, "metadata_cache_size")) && +(def->metadata_cache_size = virDomainDiskMetadataCacheSizeTypeFromString(tmp)) <= 0) { +virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("unknown driver metadata_cache_size value '%s'"), tmp); +goto cleanup; +} +VIR_FREE(tmp); + ret = 0; cleanup: @@ -23874,6 +23887,10 @@ virDomainDiskDefFormatDriver(virBufferPtr buf, if (disk->queues) virBufferAsprintf(&driverBuf, " queues='%u'", disk->queues); +if (disk->metadata_cache_size) +virBufferAsprintf(&driverBuf, " metadata_cache_size='%s'", + virDomainDiskMetadataCacheSizeTypeToString(disk->metadata_cache_size)); + virDomainVirtioOptionsFormat(&driverBuf, disk->virtio); ret = virXMLFormatElement(buf, "driver", &driverBuf, NULL); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index e30a4b2..b155058 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -568,6 +568,13 @@ typedef enum { VIR_DOMAIN_DISK_DETECT_ZEROES_LAST } virDomainDiskDetectZeroes; +typedef enum { +VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_DEFAULT = 0, +VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_MAXIMUM, + +VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_LAST +} virDomainDiskMetadataCacheSize; + typedef struct _virDomainBlockIoTuneInfo virDomainBlockIoTuneInfo; struct _virD
Re: [libvirt] [PATCH] nodedev: Document the udevEventHandleThread
On Fri, Oct 19, 2018 at 08:04:43AM -0400, John Ferlan wrote: > Add details of the algorithm and why it was used to help future > readers understand the issues encountered. Based largely off a > description provided by Erik Skultety . Since the link to the archive would be missing from the commit message, I feel like simply blaming commit cdbe13329a7 for lacking documentation/justification of the algorithm used is the right way of composing the commit message. I appreciate the effort to document it properly, although I feel it's a bit overwhelming, so the comments I have below is just a sheer attempt to reduce the amount of text documenting a single function, you know, if a function needs a comment like that, it suggests that there's something wrong with that function... > > Signed-off-by: John Ferlan > --- > > see: https://www.redhat.com/archives/libvir-list/2018-October/msg00915.html > > src/node_device/node_device_udev.c | 49 ++ > 1 file changed, 49 insertions(+) > > diff --git a/src/node_device/node_device_udev.c > b/src/node_device/node_device_udev.c > index 22897591de..8ca81e65ad 100644 > --- a/src/node_device/node_device_udev.c > +++ b/src/node_device/node_device_udev.c > @@ -1591,6 +1591,53 @@ udevEventMonitorSanityCheck(udevEventDataPtr priv, > } > > > +/** > + * udevEventHandleThread > + * @opaque: unused > + * > + * Thread to handle the udevEventHandleCallback processing when udev > + * tells us there's a device change for us (add, modify, delete, etc). > + * > + * Management of the processing udev device notification is a delicate > + * balance between the udev process, the scheduler, and when exactly the > + * data from/for the socket is received. I'd drop ^this sentence, the issue with this unfortunate mix of factors is explained further down, let's try keeping the amount of commentary down. > + * The udev processing code notifies > + * the udevEventHandleCallback that it has data; however, the actual recv() > + * done in the udev code to fill in the @device structure can be delayed > + * due to how threads are scheduled. How about: "When we get notified by udev that there are data to be processed, the actual data retrieval by libudev may be delayed due to how threads are scheduled." > + * > + * As it turns out, the event loop could be scheduled (and it in fact > + * was) earlier than the handler thread. What that essentially means is > + * that by the time the thread actually handled the event and read the > + * data from the monitor, the event loop fired the very same event, simply > + * because the data hadn't been retrieved from the socket at that point yet. ^This doesn't need to be a separate paragraph, you can connect it to the previous one. Also, some words don't feel right for commentary purposes (I know they're mine :D). How about: "In fact, the event loop could be scheduled earlier than the handler thread, thus potentially emitting the very same event the handler thread is currently trying to process, simply because the data hadn't been retrieved from the socket at that time yet." > + * > + * Thus this algorithm is that once udevEventHandleCallback is notified > + * there is data, ^this can be dropped... > + * this loop will process as much data as possible until > + * udev_monitor_receive_device fails to get the @device. This may be > + * because we've processed everything or because the scheduling thread > + * hasn't been able to populate data in the socket. Once we're done > + * processing everything we can, wait on the next event/notification. How about: "Instead of waiting for the next event after retrieving a device data we keep reading from the udev monitor until we encounter one of EAGAIN or EWOULDBLOCK errors, only then we'll wait for the next event." Also, I'd move that bit into the function body, so it's directly tied to the piece of code responsible for it. > + * Although this may incur a slight delay if the socket data wasn't > + * populated, the event/notifications are fairly consistent so there's > + * no need to use virCondWaitUntil. ^This could be dropped. > + * > + * NB: Usage of event based socket algorithm causes some issues with > + * older platforms such as CentOS 6 in which the data socket is opened > + * without the NONBLOCK bit set. Still even if the reset of priv->dataReady > + * were moved to just after the udev_monitor_receive_device in order to > + * avoid the NONBLOCK issue, the scheduler would still come into play "Trying to move the reset of the @priv->dataReady flag within the loop wouldn't help much as the scheduler would still come into play." Everything written below could go to the commit message itself. Would you agree to the proposed adjustments? Erik > + * especially for environments when multiple devices are added into the > + * system. Even those environments would still be blocked on the udev > + * socket recv() call. The algorithm for handling that scenario would > + * be more complex
[libvirt] [PATCH 0/3] add disk driver metadata_cache_size option
Hi, all. This is a patch series after offlist agreement on introducing metadata-cache-size option for disks. The options itself is described in 2nd patch of the series. There is a plenty of attempts to add option to set qcow2 metadata cache sizes in past, see [1] [2] to name recent that has comments I tried to address or has work I reused to some extent. I would like to address Peter's comment in [1] that this option is image's option rather then disk's here. While this makes sense if we set specific cache value that covers disk only partially, here when we have maximum policy that covers whole disk it makes sense to set same value for all images. The setted value is only upper limit on cache size and thus cache for every image will grow proportionally to image data size "visible from top" eventually I guess. And these sizes are changing during guest lifetime - thus we need set whole disk limit for every image for 'maxium' effect. On the other hand if we add policies different from 'maximum' it may require per image option. Well I can't imagine name for element for backing chain that will have cache size attribute better then 'driver'). And driver is already used for top image so I leave it this way. KNOWN ISSUES 1. when using -driver to configure disks in command line cache size does not get propagated thru backing chain 2. when making external disk snapshot cache size sneak into backing file filename and thus cache size for backing image became remembered there 3. blockcommit after such snapshot will not work because libvirt is not ready for backing file name is reported as json sometimes Looks like 1 can be addressed in qemu. The reason for behaviour in 2 I do not understand honestly. And 3 will be naturally addessed after blockjobs starts working with blockdev configuration I guess. LINKS [1] [PATCH] qemu: Added support L2 table cache for qcow2 disk https://www.redhat.com/archives/libvir-list/2018-July/msg00166.html [2] [PATCH v6 0/2] Add support for qcow2 cache https://www.redhat.com/archives/libvir-list/2017-September/msg00553.html Nikolay Shirokovskiy (3): qemu: caps: add drive.qcow2.l2-cache-size xml: add disk driver metadata_cache_size option qemu: support metadata-cache-size for blockdev docs/formatdomain.html.in | 7 docs/schemas/domaincommon.rng | 10 + src/conf/domain_conf.c | 17 src/conf/domain_conf.h | 9 src/qemu/qemu_block.c | 5 ++- src/qemu/qemu_capabilities.c | 3 ++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c| 26 src/qemu/qemu_domain.c | 2 + src/util/virstoragefile.c | 1 + src/util/virstoragefile.h | 1 + tests/qemucapabilitiesdata/caps_2.10.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.10.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_2.10.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.10.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.11.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.11.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.5.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.6.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.6.0.ppc64.xml| 1 + tests/qemucapabilitiesdata/caps_2.6.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.7.0.s390x.xml| 1 + tests/qemucapabilitiesdata/caps_2.7.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.8.0.s390x.xml| 1 + tests/qemucapabilitiesdata/caps_2.8.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_2.9.0.ppc64.xml| 1 + tests/qemucapabilitiesdata/caps_2.9.0.s390x.xml| 1 + tests/qemucapabilitiesdata/caps_2.9.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml| 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml| 1 + tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml | 1 + .../qemuxml2argvdata/disk-metadata_cache_size.args | 34 +++ .../qemuxml2argvdata/disk-metadata_cache_size.xml | 42 +++ tests/qemuxml2argvtest.c | 2 + .../disk-metadata_cache_size.xml | 48 ++ tests/qemuxml2xmltest.c| 2 + 42 files changed, 235 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/disk-metadata_cache_size.args create mode 100644 tests/qemuxml2argvdat
[libvirt] [PATCH 3/3] qemu: support metadata-cache-size for blockdev
Just set l2-cache-size to INT64_MAX for all format nodes of qcow2 type in block node graph. AFAIK this is sane because *actual* cache size depends on size of data being referenced in image and thus the total size of all cache sizes for all images in disk backing chain will not exceed the cache size that covers just one full image as in case of no backing chain. Signed-off-by: Nikolay Shirokovskiy --- src/qemu/qemu_block.c | 5 - src/qemu/qemu_domain.c| 1 + src/util/virstoragefile.c | 1 + src/util/virstoragefile.h | 1 + 4 files changed, 7 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c index 5321dda..8771cc1 100644 --- a/src/qemu/qemu_block.c +++ b/src/qemu/qemu_block.c @@ -1322,7 +1322,6 @@ qemuBlockStorageSourceGetFormatQcow2Props(virStorageSourcePtr src, * 'pass-discard-snapshot' * 'pass-discard-other' * 'overlap-check' - * 'l2-cache-size' * 'l2-cache-entry-size' * 'refcount-cache-size' * 'cache-clean-interval' @@ -1331,6 +1330,10 @@ qemuBlockStorageSourceGetFormatQcow2Props(virStorageSourcePtr src, if (qemuBlockStorageSourceGetFormatQcowGenericProps(src, "qcow2", props) < 0) return -1; +if (src->metadata_cache_size == VIR_DOMAIN_DISK_METADATA_CACHE_SIZE_MAXIMUM && +virJSONValueObjectAdd(props, "I:l2-cache-size", INT64_MAX, NULL) < 0) +return -1; + return 0; } diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 896adf3..f87cfd2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -13245,6 +13245,7 @@ qemuDomainPrepareDiskSourceData(virDomainDiskDefPtr disk, src->iomode = disk->iomode; src->cachemode = disk->cachemode; src->discard = disk->discard; +src->metadata_cache_size = disk->metadata_cache_size; if (disk->device == VIR_DOMAIN_DISK_DEVICE_FLOPPY) src->floppyimg = true; diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c index 94c32d8..9089e2f 100644 --- a/src/util/virstoragefile.c +++ b/src/util/virstoragefile.c @@ -2210,6 +2210,7 @@ virStorageSourceCopy(const virStorageSource *src, ret->cachemode = src->cachemode; ret->discard = src->discard; ret->detect_zeroes = src->detect_zeroes; +ret->metadata_cache_size = src->metadata_cache_size; /* storage driver metadata are not copied */ ret->drv = NULL; diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h index 3ff6c4f..8b57399 100644 --- a/src/util/virstoragefile.h +++ b/src/util/virstoragefile.h @@ -331,6 +331,7 @@ struct _virStorageSource { int cachemode; /* enum virDomainDiskCache */ int discard; /* enum virDomainDiskDiscard */ int detect_zeroes; /* enum virDomainDiskDetectZeroes */ +int metadata_cache_size; /* enum virDomainDiskImageMetadataCacheSize */ bool floppyimg; /* set to true if the storage source is going to be used as a source for floppy drive */ -- 1.8.3.1 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH libvirt-python] event-test.py: Report ERROR events
VIR_DOMAIN_EVENT_ID_IO_ERROR and VIR_DOMAIN_EVENT_ID_IO_ERROR_REASON callbacks receive the same 'action' parameter, so also translate that numeric action to a descriptive text for the first callback. Signed-off-by: Philipp Hahn --- examples/event-test.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/event-test.py b/examples/event-test.py index dabf4b0..709277b 100755 --- a/examples/event-test.py +++ b/examples/event-test.py @@ -526,8 +526,8 @@ def myDomainEventWatchdogCallback(conn, dom, action, opaque): def myDomainEventIOErrorCallback(conn, dom, srcpath, devalias, action, opaque): -print("myDomainEventIOErrorCallback: Domain %s(%s) %s %s %d" % ( -dom.name(), dom.ID(), srcpath, devalias, action)) +print("myDomainEventIOErrorCallback: Domain %s(%s) %s %s %s" % ( +dom.name(), dom.ID(), srcpath, devalias, ERROR_EVENTS[action])) def myDomainEventIOErrorReasonCallback(conn, dom, srcpath, devalias, action, reason, opaque): -- 2.11.0 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH libvirt-python] event-test.py: Fix ERROR event
ERROR_EVENTS translates the numeric 'action' argument to a description, not the 'reason' argument which already contains a descriptive string like 'enospc'. > Traceback (most recent call last): > File "/usr/lib/python2.7/dist-packages/libvirt.py", line 4661, in > _dispatchDomainEventIOErrorReasonCallback > reason, opaque) > File "libvirt-python/examples/event-test.py", line 536, in > myDomainEventIOErrorReasonCallback > dom.name(), dom.ID(), srcpath, devalias, action, ERROR_EVENTS[reason])) > File "libvirt-python/examples/event-test.py", line 474, in __getitem__ > data = self.args[item] > TypeError: tuple indices must be integers, not str Fixes: f5928c6711654f1496707ca77f626b3192843d57 Signed-off-by: Philipp Hahn --- examples/event-test.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/event-test.py b/examples/event-test.py index eeb2777..dabf4b0 100755 --- a/examples/event-test.py +++ b/examples/event-test.py @@ -531,8 +531,8 @@ def myDomainEventIOErrorCallback(conn, dom, srcpath, devalias, action, opaque): def myDomainEventIOErrorReasonCallback(conn, dom, srcpath, devalias, action, reason, opaque): -print("myDomainEventIOErrorReasonCallback: Domain %s(%s) %s %s %d %s" % ( -dom.name(), dom.ID(), srcpath, devalias, action, ERROR_EVENTS[reason])) +print("myDomainEventIOErrorReasonCallback: Domain %s(%s) %s %s %s %s" % ( +dom.name(), dom.ID(), srcpath, devalias, ERROR_EVENTS[action], reason)) def myDomainEventGraphicsCallback(conn, dom, phase, localAddr, remoteAddr, authScheme, subject, opaque): -- 2.11.0 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2] qemu: Introduce caching whether /dev/kvm is accessible
On Thu, Nov 01, 2018 at 09:07:05AM +0100, Michal Privoznik wrote: > On 11/01/2018 05:11 AM, Ján Tomko wrote: > > IIRC the problem was that users with vmx disabled in BIOS setup needed > > to delete libvirt's qemuCaps cache manually after enabling it even after > > restarting the system. > > Sounds like it. The commit that introduced the check is d87df9bd39b. > > Honestly, I don't think we need to care about perms - we can assume > those are set properly (as they don't change often). What we have to We can't assume permissions are correct. One of the core issues we hit was that the udev rule which sets the permissions has been in the QEMU RPM historically. So you can have the kernel module loaded automatically by systemd when it detects VMX, but its permissions will be 0600 until the qemu-kvm RPM is installed and udev is triggered to fix the permissions to 0666. > care about is inserting/removing kvm module (even though it's not > necessary these days since virtualbox has learned how to co-exist with > KVM). Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2] qemu: Introduce caching whether /dev/kvm is accessible
On Thu, Nov 01, 2018 at 09:31:50AM +0100, Martin Kletzander wrote: > On Tue, Oct 30, 2018 at 03:07:59PM +, Daniel P. Berrangé wrote: > > On Tue, Oct 30, 2018 at 03:45:36PM +0100, Michal Privoznik wrote: > > > On 10/30/2018 02:46 PM, Michal Privoznik wrote: > > > > > > > > It is kernel problem. In my testing, the moment I call: > > > > > > > > ioctl(kvm, KVM_CREATE_VM, 0); > > > > > > Okay, I have to retract this claim. 'udevadm monitor' shows some events: > > > > > > KERNEL[3631.129645] change /devices/virtual/misc/kvm (misc) > > > UDEV [3631.130816] change /devices/virtual/misc/kvm (misc) > > > > > > and stopping udevd leaves all three times untouched. So it is udev after > > > all. I just don't know how to find the rule that is causing the issue. > > > Anyway, as for this patch, I think we can merge it in the end, can't we? > > > > Not really. Udev is in use everywhere, so this behaviour makes the > > patch useless in practice, even though it is technically right in > > theory :-( > > > > Does it? With this behaviour we still do the "expensive" work after any > machine > has started. But for one machine starting it still has the effect of running > it > only once. And we *need* to run it for each machine unless we also cache the > result per (at least) user:group of that machine as every machine can run > under > different user:group. Yes, with the current patch it still rechecks it for each VM startup, but I was going to suggest the caching of this is simply put in a global static variable as there's no reason to record this device permissions state in the per VM caps cache. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] qemu: Report warning if QoS is set for vhostuser interface
On Wed, Oct 31, 2018 at 04:47:45PM +0100, Michal Privoznik wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=1524230 > > Because of historical reasons, we are not denying starting a > domain which has QoS set for unsupported type of device. We do > report just a warning instead. And even though we perhaps used to > do so for vhostuser it got lost somewhere. Bring it back. So, if my blame-fu isn't flawed, then I don't think that vhostuser ever reported a warning during machine startup, have a look at commits 4a74ccdb92f and 0bce012d7f0 respectively - there always was a goto cleanup statement. > > Signed-off-by: Michal Privoznik > --- > src/qemu/qemu_command.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c > index 6e3ff67660..489e8bc689 100644 > --- a/src/qemu/qemu_command.c > +++ b/src/qemu/qemu_command.c > @@ -8246,6 +8246,12 @@ qemuBuildVhostuserCommandLine(virQEMUDriverPtr driver, > goto cleanup; > } > > +if (virDomainNetGetActualBandwidth(net)) { > +VIR_WARN("setting bandwidth on interfaces of " > + "type '%s' is not implemented yet", > + virDomainNetTypeToString(VIR_DOMAIN_NET_TYPE_VHOSTUSER)); > +} > + We already do this kind of checking on line 8593 in the same source file. It's just because of the goto cleanup jump after calling qemuBuildVhostuserCommandLine that causes the logic to never reach the condition (Not to mention the code is a mess). I'd suggest moving the already existing check to the top of the qemuBuildInterfaceCommandLine function instead of duplicating the code. Erik -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2] qemu: Introduce caching whether /dev/kvm is accessible
On Mon, Oct 29, 2018 at 06:34:58PM +0100, Marc Hartmayer wrote: Introduce caching whether /dev/kvm is usable as the QEMU user:QEMU group. This reduces the overhead of the QEMU capabilities cache lookup. Before this patch there were many fork() calls used for checking whether /dev/kvm is accessible. Now we store the result whether /dev/kvm is accessible or not and we only need to re-run the virFileAccessibleAs check if the ctime of /dev/kvm has changed. Suggested-by: Daniel P. Berrangé Signed-off-by: Marc Hartmayer --- src/qemu/qemu_capabilities.c | 54 ++-- 1 file changed, 52 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index e228f52ec0bb..85516954149b 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -3238,6 +3238,10 @@ struct _virQEMUCapsCachePriv { virArch hostArch; unsigned int microcodeVersion; char *kernelVersion; + +/* cache whether /dev/kvm is usable as runUid:runGuid */ Even though this doesn't solve all of it (the ctime change with udev etc.) it makes it better. However this needs to be checked (or cached) per seclabel of the domain as it can run under non-default user. +virTristateBool kvmUsable; +time_t kvmCtime; }; typedef struct _virQEMUCapsCachePriv virQEMUCapsCachePriv; typedef virQEMUCapsCachePriv *virQEMUCapsCachePrivPtr; @@ -3824,6 +3828,52 @@ virQEMUCapsSaveFile(void *data, } +/* Determine whether '/dev/kvm' is usable as QEMU user:QEMU group. */ +static bool +virQEMUCapsKVMUsable(virQEMUCapsCachePrivPtr priv) +{ +struct stat sb; +static const char *kvm_device = "/dev/kvm"; +virTristateBool value; +virTristateBool cached_value = priv->kvmUsable; +time_t kvm_ctime; +time_t cached_kvm_ctime = priv->kvmCtime; + +if (stat(kvm_device, &sb) < 0) { +virReportSystemError(errno, + _("Failed to stat %s"), kvm_device); +return false; +} +kvm_ctime = sb.st_ctime; + +if (kvm_ctime != cached_kvm_ctime) { +VIR_DEBUG("%s has changed (%lld vs %lld)", kvm_device, + (long long)kvm_ctime, (long long)cached_kvm_ctime); +cached_value = VIR_TRISTATE_BOOL_ABSENT; +} + +if (cached_value != VIR_TRISTATE_BOOL_ABSENT) +return cached_value == VIR_TRISTATE_BOOL_YES; + +if (virFileAccessibleAs(kvm_device, R_OK | W_OK, +priv->runUid, priv->runGid) == 0) { +value = VIR_TRISTATE_BOOL_YES; +} else { +value = VIR_TRISTATE_BOOL_NO; +} + +/* There is a race window between 'stat' and + * 'virFileAccessibleAs'. However, since we're only interested in + * detecting changes *after* the virFileAccessibleAs check, we can + * neglect this here. + */ +priv->kvmCtime = kvm_ctime; +priv->kvmUsable = value; + +return value == VIR_TRISTATE_BOOL_YES; +} + + static bool virQEMUCapsIsValid(void *data, void *privData) @@ -3872,8 +3922,7 @@ virQEMUCapsIsValid(void *data, return true; } -kvmUsable = virFileAccessibleAs("/dev/kvm", R_OK | W_OK, -priv->runUid, priv->runGid) == 0; +kvmUsable = virQEMUCapsKVMUsable(priv); if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_KVM) && kvmUsable) { @@ -4684,6 +4733,7 @@ virQEMUCapsCacheNew(const char *libDir, priv->runUid = runUid; priv->runGid = runGid; priv->microcodeVersion = microcodeVersion; +priv->kvmUsable = VIR_TRISTATE_BOOL_ABSENT; if (uname(&uts) == 0 && virAsprintf(&priv->kernelVersion, "%s %s", uts.release, uts.version) < 0) -- 2.17.0 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list signature.asc Description: Digital signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2] qemu: Introduce caching whether /dev/kvm is accessible
On Thu, Nov 01, 2018 at 09:31:50AM +0100, Martin Kletzander wrote: On Tue, Oct 30, 2018 at 03:07:59PM +, Daniel P. Berrangé wrote: On Tue, Oct 30, 2018 at 03:45:36PM +0100, Michal Privoznik wrote: On 10/30/2018 02:46 PM, Michal Privoznik wrote: > On 10/30/2018 01:55 PM, Daniel P. Berrangé wrote: >> On Tue, Oct 30, 2018 at 10:32:08AM +, Daniel P. Berrangé wrote: >>> On Tue, Oct 30, 2018 at 11:08:45AM +0100, Michal Privoznik wrote: On 10/30/2018 10:35 AM, Daniel P. Berrangé wrote: > On Tue, Oct 30, 2018 at 09:13:50AM +0100, Michal Privoznik wrote: >> On 10/29/2018 06:34 PM, Marc Hartmayer wrote: >>> Introduce caching whether /dev/kvm is usable as the QEMU user:QEMU >>> group. This reduces the overhead of the QEMU capabilities cache >>> lookup. Before this patch there were many fork() calls used for >>> checking whether /dev/kvm is accessible. Now we store the result >>> whether /dev/kvm is accessible or not and we only need to re-run the >>> virFileAccessibleAs check if the ctime of /dev/kvm has changed. >>> >>> Suggested-by: Daniel P. Berrangé >>> Signed-off-by: Marc Hartmayer >>> --- >>> src/qemu/qemu_capabilities.c | 54 ++-- >>> 1 file changed, 52 insertions(+), 2 deletions(-) >>> >>> diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c >>> index e228f52ec0bb..85516954149b 100644 >>> --- a/src/qemu/qemu_capabilities.c >>> +++ b/src/qemu/qemu_capabilities.c >>> @@ -3238,6 +3238,10 @@ struct _virQEMUCapsCachePriv { >>> virArch hostArch; >>> unsigned int microcodeVersion; >>> char *kernelVersion; >>> + >>> +/* cache whether /dev/kvm is usable as runUid:runGuid */ >>> +virTristateBool kvmUsable; >>> +time_t kvmCtime; >>> }; >>> typedef struct _virQEMUCapsCachePriv virQEMUCapsCachePriv; >>> typedef virQEMUCapsCachePriv *virQEMUCapsCachePrivPtr; >>> @@ -3824,6 +3828,52 @@ virQEMUCapsSaveFile(void *data, >>> } >>> >>> >>> +/* Determine whether '/dev/kvm' is usable as QEMU user:QEMU group. */ >>> +static bool >>> +virQEMUCapsKVMUsable(virQEMUCapsCachePrivPtr priv) >>> +{ >>> +struct stat sb; >>> +static const char *kvm_device = "/dev/kvm"; >>> +virTristateBool value; >>> +virTristateBool cached_value = priv->kvmUsable; >>> +time_t kvm_ctime; >>> +time_t cached_kvm_ctime = priv->kvmCtime; >>> + >>> +if (stat(kvm_device, &sb) < 0) { >>> +virReportSystemError(errno, >>> + _("Failed to stat %s"), kvm_device); >>> +return false; >>> +} >>> +kvm_ctime = sb.st_ctime; >> >> This doesn't feel right. /dev/kvm ctime is changed every time qemu is >> started or powered off (try running stat over it before and after a >> domain is started/shut off). So effectively we will fork more often than >> we would think. Should we cache inode number instead? Because for all >> that we care is simply if the file is there. > > Urgh, that is a bit strange and not the usual semantics for timestamps :-( Indeed. > > We can't stat the inode - the code was explicitly trying to cope with the > way /dev/kvm can change permissions when udev rules get applied. We would > have to compare the user, group, permissions mask and even ACL, or a hash > of those. Well, we can use ctime as suggested and post a patch for kernel to fix ctime behaviour. Until the patch is merged our behaviour would be suboptimal, but still better than it is now. >>> >>> I guess lets talk to KVM team for their input on this and then decide >>> what todo. >> >> Hmm, I wonder if it is not actually a kernel problem, but rather something >> in userspace genuinely touching the device in a way that caues these >> timestamps to be updated. >> > > It is kernel problem. In my testing, the moment I call: > > ioctl(kvm, KVM_CREATE_VM, 0); Okay, I have to retract this claim. 'udevadm monitor' shows some events: KERNEL[3631.129645] change /devices/virtual/misc/kvm (misc) UDEV [3631.130816] change /devices/virtual/misc/kvm (misc) and stopping udevd leaves all three times untouched. So it is udev after all. I just don't know how to find the rule that is causing the issue. Anyway, as for this patch, I think we can merge it in the end, can't we? Not really. Udev is in use everywhere, so this behaviour makes the patch useless in practice, even though it is technically right in theory :-( Does it? With this behaviour we still do the "expensive" work after any machine has started. But for one machine starting it still has the effect of running it only once. And we *need* to run it for each machine unless we also cache the result per (at least) user:group of that machine as every machine can run under different user
Re: [libvirt] [PATCH v2] vhost-user: define conventions for vhost-user backends
Hi On Wed, Oct 31, 2018 at 8:05 PM Daniel P. Berrangé wrote: > > On Wed, Oct 31, 2018 at 07:44:36PM +0400, Marc-André Lureau wrote: > > Hi > > > > On Tue, Oct 16, 2018 at 4:14 PM Daniel P. Berrangé > > wrote: > > > > > It just reinvents the chardev unix socket syntax, but in a > > > > > different adhoc manner, which is repeating the mistake we have > > > > > made time & again in QEMU. Using QAPI we can directly accept > > > > > the ChardevSocket syntax we already know about. Instead of > > > > > having --socket-path and --fd and documenting that they are > > > > > mutually exclusive ChardevSocket QAPI schema provides that > > > > > representation in a well defined format which is discoverable > > > > > and QEMU and mgmt apps already understand. > > > > > > > > That would require external vhost-user backends to implement QAPI/json > > > > parsing. Is this necessary for a vhost-user backend? I doubt it. > > > > > > They could, but would not be required, to implement QAPI/json parser. > > > > > > The QAPI schema defines a standard for how to model & interpret the > > > non-scalar values for command line arguments. An external impl would > > > need to ensure that whatever parsing it does for CLI args is semantically > > > compatible with the parsing rules defined by the QEMU QAPI schema we > > > define to ensure interoperability of its impl. > > > > So you would want to have something like? > > > > --chardev '{ "id" : "bar", "backend" : { "type" : "socket", "data" : { > > "addr" : { "type": "unix", "path": "/tmp/foo.sock" }, "server": > > "false" } } }' > > I wasn't specificially suggesting json syntax. Just the flattened > dot separate syntax, whose valid components are mapped to corresponding > qapi schema defintions. eg closer to what we have already today > > --chardev socket,id=bar,path=/tmp/foo.sock,server > ok > > > > instead of: > > > > --socket-path=/tmp/foo.sock > > > > I don't really get what that will help with, for the vhost-user backend > > case... > > It avoids presuming that we forever want to launch the backend with > a single socket path and nothing else. Using the chardev, means we > can choosen between client/server mode. We can choose whether to > pass an FD to the socket, instead of the socket path. Whether the > reconnect flag is set or not, and all the other aspects of a chardev > you can define. > We are trying to define a common specification for vhost-user backends. Where do we stop defining what --chardev should support? > I don't think we should have to add more & more adhoc CLI args (--socket-path, > --fd, --reconnect, etc) and then manually parse them & pack their values > together into a chardev object. The backends most likely won't use qemu chardev (nor qapi), and be limited to unix socket. > > > Regards, > Daniel > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2] qemu: Introduce caching whether /dev/kvm is accessible
On Tue, Oct 30, 2018 at 03:07:59PM +, Daniel P. Berrangé wrote: On Tue, Oct 30, 2018 at 03:45:36PM +0100, Michal Privoznik wrote: On 10/30/2018 02:46 PM, Michal Privoznik wrote: > On 10/30/2018 01:55 PM, Daniel P. Berrangé wrote: >> On Tue, Oct 30, 2018 at 10:32:08AM +, Daniel P. Berrangé wrote: >>> On Tue, Oct 30, 2018 at 11:08:45AM +0100, Michal Privoznik wrote: On 10/30/2018 10:35 AM, Daniel P. Berrangé wrote: > On Tue, Oct 30, 2018 at 09:13:50AM +0100, Michal Privoznik wrote: >> On 10/29/2018 06:34 PM, Marc Hartmayer wrote: >>> Introduce caching whether /dev/kvm is usable as the QEMU user:QEMU >>> group. This reduces the overhead of the QEMU capabilities cache >>> lookup. Before this patch there were many fork() calls used for >>> checking whether /dev/kvm is accessible. Now we store the result >>> whether /dev/kvm is accessible or not and we only need to re-run the >>> virFileAccessibleAs check if the ctime of /dev/kvm has changed. >>> >>> Suggested-by: Daniel P. Berrangé >>> Signed-off-by: Marc Hartmayer >>> --- >>> src/qemu/qemu_capabilities.c | 54 ++-- >>> 1 file changed, 52 insertions(+), 2 deletions(-) >>> >>> diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c >>> index e228f52ec0bb..85516954149b 100644 >>> --- a/src/qemu/qemu_capabilities.c >>> +++ b/src/qemu/qemu_capabilities.c >>> @@ -3238,6 +3238,10 @@ struct _virQEMUCapsCachePriv { >>> virArch hostArch; >>> unsigned int microcodeVersion; >>> char *kernelVersion; >>> + >>> +/* cache whether /dev/kvm is usable as runUid:runGuid */ >>> +virTristateBool kvmUsable; >>> +time_t kvmCtime; >>> }; >>> typedef struct _virQEMUCapsCachePriv virQEMUCapsCachePriv; >>> typedef virQEMUCapsCachePriv *virQEMUCapsCachePrivPtr; >>> @@ -3824,6 +3828,52 @@ virQEMUCapsSaveFile(void *data, >>> } >>> >>> >>> +/* Determine whether '/dev/kvm' is usable as QEMU user:QEMU group. */ >>> +static bool >>> +virQEMUCapsKVMUsable(virQEMUCapsCachePrivPtr priv) >>> +{ >>> +struct stat sb; >>> +static const char *kvm_device = "/dev/kvm"; >>> +virTristateBool value; >>> +virTristateBool cached_value = priv->kvmUsable; >>> +time_t kvm_ctime; >>> +time_t cached_kvm_ctime = priv->kvmCtime; >>> + >>> +if (stat(kvm_device, &sb) < 0) { >>> +virReportSystemError(errno, >>> + _("Failed to stat %s"), kvm_device); >>> +return false; >>> +} >>> +kvm_ctime = sb.st_ctime; >> >> This doesn't feel right. /dev/kvm ctime is changed every time qemu is >> started or powered off (try running stat over it before and after a >> domain is started/shut off). So effectively we will fork more often than >> we would think. Should we cache inode number instead? Because for all >> that we care is simply if the file is there. > > Urgh, that is a bit strange and not the usual semantics for timestamps :-( Indeed. > > We can't stat the inode - the code was explicitly trying to cope with the > way /dev/kvm can change permissions when udev rules get applied. We would > have to compare the user, group, permissions mask and even ACL, or a hash > of those. Well, we can use ctime as suggested and post a patch for kernel to fix ctime behaviour. Until the patch is merged our behaviour would be suboptimal, but still better than it is now. >>> >>> I guess lets talk to KVM team for their input on this and then decide >>> what todo. >> >> Hmm, I wonder if it is not actually a kernel problem, but rather something >> in userspace genuinely touching the device in a way that caues these >> timestamps to be updated. >> > > It is kernel problem. In my testing, the moment I call: > > ioctl(kvm, KVM_CREATE_VM, 0); Okay, I have to retract this claim. 'udevadm monitor' shows some events: KERNEL[3631.129645] change /devices/virtual/misc/kvm (misc) UDEV [3631.130816] change /devices/virtual/misc/kvm (misc) and stopping udevd leaves all three times untouched. So it is udev after all. I just don't know how to find the rule that is causing the issue. Anyway, as for this patch, I think we can merge it in the end, can't we? Not really. Udev is in use everywhere, so this behaviour makes the patch useless in practice, even though it is technically right in theory :-( Does it? With this behaviour we still do the "expensive" work after any machine has started. But for one machine starting it still has the effect of running it only once. And we *need* to run it for each machine unless we also cache the result per (at least) user:group of that machine as every machine can run under different user:group. I'll go through the patch again (just skimmed it the first
Re: [libvirt] [PATCH v2] qemu: Introduce caching whether /dev/kvm is accessible
On 11/01/2018 05:11 AM, Ján Tomko wrote: > On Tue, Oct 30, 2018 at 03:53:59PM +, Daniel P. Berrangé wrote: >> On Tue, Oct 30, 2018 at 04:47:32PM +0100, Michal Privoznik wrote: >>> Well, caching owner + seclabels + ACLs won't help either. What if user >>> loads some profile into AppArmor or something that denies previously >>> allowed access to /dev/kvm (or vice versa)? What I am saying is that >>> there are some security models which base their decisions on something >>> else than file attributes. >> >> The virFileAccessibleAs check for /dev/kvm was put in their to ensure >> that we correctly report usability of KVM in the capabilities XML >> according to file permissions/ownership. Essentially KVM is not usable >> until the udev rule is applied to change permissions to world accessible >> or to set the kvm group. >> > > Can libvirt be run sooner than the permissions are set up during regular > start-up, or this is just for the case of the user randomly touching the > permissions? Sure it can. And it is. For instance, I have ConsoleKit installed on my system and when I log in as a regular user into my system, it adds an ACL entry to /dev/kvm to allow access to the user. Libvirtd is already running at that point (well, except not really because I don't run it as a service since I'm running it from git all the time. But you get the picture). > > IIRC the problem was that users with vmx disabled in BIOS setup needed > to delete libvirt's qemuCaps cache manually after enabling it even after > restarting the system. Sounds like it. The commit that introduced the check is d87df9bd39b. Honestly, I don't think we need to care about perms - we can assume those are set properly (as they don't change often). What we have to care about is inserting/removing kvm module (even though it's not necessary these days since virtualbox has learned how to co-exist with KVM). Michal -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] nwfilter: intantiate filters on loading driver
On 01.11.2018 03:48, John Ferlan wrote: > > > On 10/31/18 10:41 AM, Nikolay Shirokovskiy wrote: >> On 31.10.2018 16:14, John Ferlan wrote: >>> >>> >>> On 10/30/18 3:24 AM, Nikolay Shirokovskiy wrote: On 29.10.2018 22:37, John Ferlan wrote: > > > On 10/15/18 4:26 AM, Nikolay Shirokovskiy wrote: >> Before using filters binding filters instantiation was done by >> hypervisors >> drivers initialization code (qemu was the only such hypervisor). Now qemu >> reconnection done supposes it should be done by nwfilter driver probably. >> Let's add this missing step. >> >> Signed-off-by: Nikolay Shirokovskiy >> --- >> src/nwfilter/nwfilter_driver.c | 3 +++ >> 1 file changed, 3 insertions(+) >> > > If there's research you've done where the instantiation was done before > introduction of the nwfilter bindings that would be really helpful... > > I found that virNWFilterBuildAll was introduced in commit 3df907bfff. > There were 2 callers for it: > >1. virNWFilterTriggerRebuildImpl >2. nwfilterStateReload > > The former called as part of the virNWFilterConfLayerInit callback > during nwfilterStateInitialize (about 50 lines earlier). >>> >>> First off let me say you certainly find a lot of interesting bugs/issues >>> that are complex and that's good. Often times I wonder how you trip >>> across things or how to quantify what you've found. Some are simpler >>> than others. This one on the surface would seem to be simple, but I keep >>> wondering is there a downside. >> >> The issue is related to my recent posts: >> >> [1] [RFC] Faster libvirtd restart with nwfilter rules >> https://www.redhat.com/archives/libvir-list/2018-September/msg01206.html >> which continues here: >> https://www.redhat.com/archives/libvir-list/2018-October/msg00657.html >> >> In short if there is lots of VMs with filters then libvirtd restart takes a >> lot of time >> during which libvirtd is unresponsive. By the way the issue is found in >> libvirt >> versions 3.9.0 and ealier (don't know of newer versions, Virtuozzo is based >> on 3.9.0 now, >> just like Centos7 :) ) > > So many different issues - trying to focus on just the one for this > particular patch. > >> >> And attempts to fix it: >> >> [2] [PATCH RFC 0/4] nwfilter: don't reinstantiate filters if they are not >> changed >> https://www.redhat.com/archives/libvir-list/2018-October/msg00904.html >> >> [3] [PATCH v2 0/2] nwfilter: don't reinstantiate rules if there is no need >> to >> https://www.redhat.com/archives/libvir-list/2018-October/msg01317.html >> >> And the reason I started v2 is Laine mentioned that it is important to >> reinstantiate rules on restart (v1 do not care if somebody messed tables). >> > > I've seen the patches and even read some briefly, but still need to take > this particular patch as separate from those. > >> And I discovered that upstream version stops reinstantiating rules at all >> after introducing filters bindings. And this functionality is old: > > So the key is "when" did that happen? That is can you pinpoint or > bisect a time in history where the filters were instantiated? And by > instantiated what exactly (call) are you referencing? The below commit is the one. It adds instantiating filters on libvirtd restart. By instantiating I mean that firewall rules correspondent to filters are actually get [re]added to iptables etc. > >> >> commit cf6f8b9a9720fe5323a84e690de9fbf8ba41f6ac >> Author: Stefan Berger >> Date: Mon Aug 16 12:59:54 2010 -0400 >> >> nwfilter: extend nwfilter reload support >> > > Yes, nwfilter is old, brittle, and in need of attention. Not sure anyone > really wants to take on the task though! I just realized I never got the > common object code implemented there. Mostly because of lock issues. > Suffice to say there's more "interesting" changes in the nwfilter space > being discussed internally. > >>> >>> The nwfilter processing is kindly said rather strange, complex, and to a >>> degree fragile. Thus when patches come along they are met with greater >>> scrutiny. From just reading the commit message here I don't get a sense >>> for the problem and why the solution fixes it. So I'm left to try and >>> research myself and ask a lot of questions. >>> >>> BTW, some of the detail provided in this response is really useful as >>> either part of a cover or after the --- in the single patch. I would >>> think you'd "know" when you've done lots of research into a problem that >>> providing reviews more than a mere hint would be useful! For nwfilter >>> bindings, it's hard to imagine this is one of those - it just happened >>> type events. There seems to be a very specific sequence that's missing >>> from the commit description. >>> Right but virNWFilterTriggerRebuildImpl is not actually called, it is set as rebuild callback. This rebuild callback can be called in virNWFi