[libvirt] [RFC PATCH v2 1/1] qemu: host NUMA hugepage policy without guest NUMA
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored. Such guests would use and a block with but no elements. This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can). Note: This is not visible to the guest and does not appear to create a migration incompatibility. Signed-off-by: Sam Bobroff --- Hello libvirt community, This patch is a RFC of an attempt to understand and fix a problem with NUMA memory placement policy being ignored in a specific situation. Previous discussion on this issue: https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html https://www.redhat.com/archives/libvir-list/2016-October/msg00514.html (Sorry, this is a long description because the situation is complicated and there are several ways of approaching it and I need to explain my reasoning for choosing this specific method.) The issue I'm trying to fix: The problem is apparent when using this combination of features: * The guest is backed by hugepages in the host. * The backing memory is constrained by NUMA memory placement policy. * The guest either does not use NUMA nodes at all, or does but does not specify a per-node policy. The guest XML would therefore contain blocks similar to the following: (And does not use ... .) (Policy works correctly without hugepages, or when memnode is used.) What happens is that the guest runs successfully but the NUMA memory placement policy is ignored, allowing memory to be allocated on any (host) node. Attempted solutions: The most obvious approach to me is a trivial patch to qemuBuildMemoryBackendStr(). The test that checks to see if a memory-backend is necessary (around line 3332 in qemu_command.c) is this: /* If none of the following is requested... */ if (!needHugepage && !userNodeset && !memAccess && !nodeSpecified && !force) { /* report back that using the new backend is not necessary * to achieve the desired configuration */ ret = 1; } else { This test does not consider the nodemask, so when no other reason exists to use a memory-backend (which is the case in point) the policy is lost. Adding nodemask to this test appears to easily fixe the issue. However: * It only works when the guest uses NUMA nodes (because otherwise the whole function is skipped at a higher level). * When the guest does use NUMA nodes, it causes a memory-backend object to be added to each one (when they did not previously have one) and this causes migration incompatability. So it seems worthwhile to look for other options. Other ways of controlling NUMA memory placement are numactl (libnuma) or cgroups but in practice neither can work in this situation due to the interaction of several features: * numactl and cgroups can only apply policy to all memory allocated by that thread. * When a guest is backed by hugepages, memory needs to be preallcated. * QEMU performs memory pre-allocation from it's main thread, not the VCPU threads. This seems to mean that policy applied to the QEMU VCPU threads does not apply to pre-allocated memory and policy applied to the QEMU main thread incorrectly applies to all of QEMU's allocations, not just the VCPU ones. My testing seems to confirm this. If that is true, then neither of those approaches can work (at least, not without changes to QEMU). At this point it appeared that the only ways to proceed would all require a migration-breaking change, so I started to look at QEMU to see if that could be worked around somehow. ... another option: While investigating QEMU I discovered that it is possible to use a QEMU memory backend object without attaching it to any guest NUMA node and when it is used that way it does not break migration. This is (to me at least) a new way of using a memory-backend object, but it appears to directly replace -mem-path and -mem-prealloc and additionally allow NUMA memory placement policy to be specified. (The ID field is still required even though it is not used.) For example, this... -m 4G -mem-path /dev/hugepages -mem-prealloc ... seems to be equavalent to this: -m 4G -object memory-backend-file,id=xxx,prealloc=yes,path=/dev/hugepages,size=4G But this is now possible as well: -m 4G -object memory-backend-file,id=xxx,prealloc=yes,path=/dev/hugepages,host-nodes=0-1,size=4G ... so it seems to be a way of solving this issue without breaking migration. Additionally, there seems to be code specifically in QEMU to prevent the migration data from changing in this case, which leads me to believe that this is an intended use of the option but I haven't directly consulted with the QEMU community. Impleme
Re: [libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA
On Tue, Oct 25, 2016 at 02:13:07PM +0200, Martin Kletzander wrote: > On Tue, Oct 25, 2016 at 01:10:23PM +1100, Sam Bobroff wrote: > >On Tue, Oct 18, 2016 at 10:43:31PM +0200, Martin Kletzander wrote: > >>On Mon, Oct 17, 2016 at 03:45:09PM +1100, Sam Bobroff wrote: > >>>On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote: > >>>>On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote: > >>>>>I did look at the libnuma and cgroups approaches, but I was concerned > >>>>>they > >>>>>wouldn't work in this case, because of the way QEMU allocates memory when > >>>>>mem-prealloc is used: the memory is allocated in the main process, > >>>>>before the > >>>>>CPU threads are created. (This is based only on a bit of hacking and > >>>>>debugging > >>>>>in QEMU, but it does seem explain the behaviour I've seen so far.) > >>>>> > >>>> > >>>>But we use numactl before QEMU is exec()'d. > >>> > >>>Sorry, I jumped ahead a bit. I'll try to explain what I mean: > >>> > >>>I think the problem with using this method would be that the NUMA policy is > >>>applied to all allocations by QEMU, not just ones related to the memory > >>>backing. I'm not sure if that would cause a serious problem but it seems > >>>untidy, > >>>and it doesn't happen in other situations (i.e. with separate memory > >>>backend > >>>objects, QEMU sets up the policy specifically for each one and other > >>>allocations aren't affected, AFAIK). Presumably, if memory were very > >>>restricted it could prevent the guest from starting. > >>> > >> > >>Yes, it is, that's what does if you don't have any > >>other () specifics set. > >> > >>>>>I think QEMU could be altered to move the preallocations into the VCPU > >>>>>threads but it didn't seem trivial and I suspected the QEMU community > >>>>>would > >>>>>point out that there was already a way to do it using backend objects. > >>>>>Another > >>>>>option would be to add a -host-nodes parameter to QEMU so that the > >>>>>policy can > >>>>>be given without adding a memory backend object. (That seems like a more > >>>>>reasonable change to QEMU.) > >>>>> > >>>> > >>>>I think upstream won't like that, mostly because there is already a > >>>>way. And that is using memory-backend object. I think we could just > >>>>use that and disable changing it live. But upstream will probably want > >>>>that to be configurable or something. > >>> > >>>Right, but isn't this already an issue in the cases where libvirt is > >>>already > >>>using memory backend objects and NUMA policy? (Or does libvirt already > >>>disable > >>>changing it live in those situations?) > >>> > >> > >>It is. I'm not trying to say libvirt is perfect. There are bugs, > >>e.g. like this one. The problem is that we tried to do *everything*, > >>but it's not currently possible. I'm trying to explain how stuff works > >>now. It definitely needs some fixing, though. > > > >OK :-) > > > >Well, given our discussion, do you think it's worth a v2 of my original patch > >or would it be better to drop it in favour of some broader change? > > > > Honestly, I thought about the approaches so much I'm now not sure I'll > make a good decision. RFC could do. If I were to pick, I would go with > a new setting that would control whether we want the binding to be > changeable throughout the domain's lifetime or not so that we can make > better decisions (and don't feel bad about the bad ones). I feel the same way. OK, I'll try an RFC patch with a lot of description. I'm specifically trying to address the issue I originally raised, which isn't quite the same thing as the changeability of the bindings but I'll keep that in mind. I think your point about changing the bindings will apply in the same way whenever QEMU's memory-backend objects are used with their "host-nodes" attribute (since they are what causes QEMU to apply policy), so I don't think I'm suggesting any significant change there. If you want to add the new setting you mention above, I'd be happy to base my patch of top of that work. ;-) Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA
On Tue, Oct 18, 2016 at 10:43:31PM +0200, Martin Kletzander wrote: > On Mon, Oct 17, 2016 at 03:45:09PM +1100, Sam Bobroff wrote: > >On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote: > >>On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote: > >>>I did look at the libnuma and cgroups approaches, but I was concerned they > >>>wouldn't work in this case, because of the way QEMU allocates memory when > >>>mem-prealloc is used: the memory is allocated in the main process, before > >>>the > >>>CPU threads are created. (This is based only on a bit of hacking and > >>>debugging > >>>in QEMU, but it does seem explain the behaviour I've seen so far.) > >>> > >> > >>But we use numactl before QEMU is exec()'d. > > > >Sorry, I jumped ahead a bit. I'll try to explain what I mean: > > > >I think the problem with using this method would be that the NUMA policy is > >applied to all allocations by QEMU, not just ones related to the memory > >backing. I'm not sure if that would cause a serious problem but it seems > >untidy, > >and it doesn't happen in other situations (i.e. with separate memory backend > >objects, QEMU sets up the policy specifically for each one and other > >allocations aren't affected, AFAIK). Presumably, if memory were very > >restricted it could prevent the guest from starting. > > > > Yes, it is, that's what does if you don't have any > other () specifics set. > > >>>I think QEMU could be altered to move the preallocations into the VCPU > >>>threads but it didn't seem trivial and I suspected the QEMU community would > >>>point out that there was already a way to do it using backend objects. > >>>Another > >>>option would be to add a -host-nodes parameter to QEMU so that the policy > >>>can > >>>be given without adding a memory backend object. (That seems like a more > >>>reasonable change to QEMU.) > >>> > >> > >>I think upstream won't like that, mostly because there is already a > >>way. And that is using memory-backend object. I think we could just > >>use that and disable changing it live. But upstream will probably want > >>that to be configurable or something. > > > >Right, but isn't this already an issue in the cases where libvirt is already > >using memory backend objects and NUMA policy? (Or does libvirt already > >disable > >changing it live in those situations?) > > > > It is. I'm not trying to say libvirt is perfect. There are bugs, > e.g. like this one. The problem is that we tried to do *everything*, > but it's not currently possible. I'm trying to explain how stuff works > now. It definitely needs some fixing, though. OK :-) Well, given our discussion, do you think it's worth a v2 of my original patch or would it be better to drop it in favour of some broader change? Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA
On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote: > On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote: > >On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote: > >>On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote: > >>>On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote: > >>>>On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote: > >>>>>At the moment, guests that are backed by hugepages in the host are > >>>>>only able to use policy to control the placement of those hugepages > >>>>>on a per-(guest-)CPU basis. Policy applied globally is ignored. > >>>>> > >>>>>Such guests would use and > >>>>>a block with but no >>>>>.../> elements. > >>>>> > >>>>>This patch corrects this by, in this specific case, changing the QEMU > >>>>>command line from "-mem-prealloc -mem-path=..." (which cannot > >>>>>specify NUMA policy) to "-object memory-backend-file ..." (which can). > >>>>> > >>>>>Note: This is not visible to the guest and does not appear to create > >>>>>a migration incompatibility. > >>>>> > >>>> > >>>>It could make sense, I haven't tried yet, though. However, I still > >>>>don't see the point in using memory-backend-file. Is it that when you > >>>>don't have cpuset cgroup the allocation doesn't work well? Because it > >>>>certainly does work for me. > >>> > >>>Thanks for taking a look at this :-) > >>> > >>>The point of using a memory-backend-file is that with it, the NUMA policy > >>>can > >>>be specified to QEMU, but with -mem-path it can't. It seems to be a way to > >>>tell > >>>QEMU to apply NUMA policy in the right place. It does seem odd to me to use > >>>memory-backend-file without attaching the backend to a guest NUMA node, > >>>but it > >>>seems to do the right thing in this case. (If there are guest NUMA nodes, > >>>or if > >>>hugepages aren't being used, policy is correctly applied.) > >>> > >>>I'll describe my test case in detail, perhaps there's something I don't > >>>understand > >>>happening. > >>> > >>>* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of > >>>hugepages > >>> on node 1, and none on node 0. > >>> > >>>* I create a 2G guest using virt-install: > >>> > >>>virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom > >>>~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu > >>>--memorybacking hugepages=on --graphics vnc --arch ppc64le > >>> > >>>* I "virsh destroy" and then "virsh edit" to add this block to the guest > >>>XML: > >>> > >>> > >>> > >>> > >>> > >>>* "virsh start", and the machine starts (I believe it should fail due to > >>>insufficient memory satasfying the policy). > >>>* "numastat -p $(pidof qemu-system-ppc64)" shows something like this: > >>> > >>>Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) > >>> Node 0 Node 1 Total > >>> --- --- --- > >>>Huge 0.00 2048.00 2048.00 > >>>Heap 8.120.008.12 > >>>Stack0.030.000.03 > >>>Private 35.806.10 41.90 > >>> --- --- --- > >>>Total 43.95 2054.10 2098.05 > >>> > >>>So it looks like it's allocated hugepages from node 1, isn't this > >>>violating the > >>>policy I set via numatune? > >>> > >> > >>Oh, now I get it. We are doing our best to apply that policy to qemu > >>even when we don't have this option. However, using this works even > >>better (which is probably* what we want). And that's the reasoning > >>behind this. > >> > &g
Re: [libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA
On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote: > On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote: > >On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote: > >>On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote: > >>>At the moment, guests that are backed by hugepages in the host are > >>>only able to use policy to control the placement of those hugepages > >>>on a per-(guest-)CPU basis. Policy applied globally is ignored. > >>> > >>>Such guests would use and > >>>a block with but no >>>.../> elements. > >>> > >>>This patch corrects this by, in this specific case, changing the QEMU > >>>command line from "-mem-prealloc -mem-path=..." (which cannot > >>>specify NUMA policy) to "-object memory-backend-file ..." (which can). > >>> > >>>Note: This is not visible to the guest and does not appear to create > >>>a migration incompatibility. > >>> > >> > >>It could make sense, I haven't tried yet, though. However, I still > >>don't see the point in using memory-backend-file. Is it that when you > >>don't have cpuset cgroup the allocation doesn't work well? Because it > >>certainly does work for me. > > > >Thanks for taking a look at this :-) > > > >The point of using a memory-backend-file is that with it, the NUMA policy can > >be specified to QEMU, but with -mem-path it can't. It seems to be a way to > >tell > >QEMU to apply NUMA policy in the right place. It does seem odd to me to use > >memory-backend-file without attaching the backend to a guest NUMA node, but > >it > >seems to do the right thing in this case. (If there are guest NUMA nodes, or > >if > >hugepages aren't being used, policy is correctly applied.) > > > >I'll describe my test case in detail, perhaps there's something I don't > >understand > >happening. > > > >* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of > >hugepages > > on node 1, and none on node 0. > > > >* I create a 2G guest using virt-install: > > > >virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom > >~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu > >--memorybacking hugepages=on --graphics vnc --arch ppc64le > > > >* I "virsh destroy" and then "virsh edit" to add this block to the guest XML: > > > > > > > > > > > >* "virsh start", and the machine starts (I believe it should fail due to > >insufficient memory satasfying the policy). > >* "numastat -p $(pidof qemu-system-ppc64)" shows something like this: > > > >Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) > > Node 0 Node 1 Total > > --- --- --- > >Huge 0.00 2048.00 2048.00 > >Heap 8.120.008.12 > >Stack0.030.000.03 > >Private 35.806.10 41.90 > > --- --- --- > >Total 43.95 2054.10 2098.05 > > > >So it looks like it's allocated hugepages from node 1, isn't this violating > >the > >policy I set via numatune? > > > > Oh, now I get it. We are doing our best to apply that policy to qemu > even when we don't have this option. However, using this works even > better (which is probably* what we want). And that's the reasoning > behind this. > > * I'm saying probably because when I was adding numactl binding to be > used together with cgroups, I was told that we couldn't change the > binding afterwards and it's bad. I feel like we could do something > with that and it would help us in the future, but there needs to be a > discussion, I guess. Because I might be one of the few =) > > So to recapitulate that, there are three options how to affect the > allocation of qemu's memory: > > 1) numactl (libnuma): it works as expected, but cannot be changed later > > 2) cgroups: so strict it has to be applied after qemu started, due to >that it doesn't work right, especially for stuff that gets all >pre-allocated (like hugepages). it can be changed later, but it >won
Re: [libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA
On Wed, Oct 12, 2016 at 09:48:07AM +0200, Peter Krempa wrote: > On Wed, Oct 12, 2016 at 15:04:53 +1100, Sam Bobroff wrote: > > At the moment, guests that are backed by hugepages in the host are > > only able to use policy to control the placement of those hugepages > > on a per-(guest-)CPU basis. Policy applied globally is ignored. > > > > Such guests would use and > > a block with but no > .../> elements. > > > > This patch corrects this by, in this specific case, changing the QEMU > > command line from "-mem-prealloc -mem-path=..." (which cannot > > specify NUMA policy) to "-object memory-backend-file ..." (which can). > > > > Note: This is not visible to the guest and does not appear to create > > a migration incompatibility. > > > > Signed-off-by: Sam Bobroff > > --- > > There was some discussion leading up to this patch, here: > > > > https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html > > > > During that discussion it seemed that fixing this issue would cause > > migration incompatibility but after some careful testing, it appears > > that it only does when a memory-backend object is attached to a guest > > NUMA node and that is not the case here. If only one is created, and > > used globally (not attached via mem=), the migration data does not > > seem to be changed and so it seems reasonable to try something like > > this patch. > > > > This patch does work for my test cases but I don't claim a deep > > understanding of the libvirt code so this is at least partly a RFC. > > Comments welcome :-) > > > > Cheers, > > Sam. > > > > src/qemu/qemu_command.c | 55 > > - > > src/qemu/qemu_command.h | 2 +- > > 2 files changed, 46 insertions(+), 11 deletions(-) > > > > diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c > > index 0804133..c28c8f2 100644 > > --- a/src/qemu/qemu_command.c > > +++ b/src/qemu/qemu_command.c > > @@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, > >int guestNode, > >virBitmapPtr userNodeset, > >virBitmapPtr autoNodeset, > > - virDomainDefPtr def, > > + const virDomainDefPtr def, > > This does not work as expected. Using 'const' with types that hide the > pointer makes the pointer const and not the structure that the pointer > points to. Ah, of course you're right. > >virQEMUCapsPtr qemuCaps, > >virQEMUDriverConfigPtr cfg, > >const char **backendType, > > @@ -7129,12 +7129,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd, > > > > static int > > qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, > > -const virDomainDef *def, > > +const virDomainDefPtr def, > > Same as above. This change is wrong. OK. > > virQEMUCapsPtr qemuCaps, > > -virCommandPtr cmd) > > +virCommandPtr cmd, > > +virBitmapPtr auto_nodeset) > > { > > const long system_page_size = virGetSystemPageSizeKB(); > > char *mem_path = NULL; > > +virBitmapPtr nodemask = NULL; > > +const char *backendType = NULL; > > +char *backendStr = NULL; > > +virJSONValuePtr props; > > +int rv = -1; > > > > /* > > * No-op if hugepages were not requested. > > @@ -7159,18 +7165,47 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, > > if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) < > > 0) > > return -1; > > > > -virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, > > NULL); > > +if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset, > > + &nodemask, -1) < 0) > > +return -1; > > +if (nodemask && virQEMUCapsGet(qemuCaps, > > QEMU_CAPS_OBJECT_MEMORY_FILE)) { > > +props = virJSONValueNewObject(); > > This is leaked, qemuBuildMemoryBackendStr allocates one itself. Ah of course, sorry. I think I can just remove that line. > > +if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def), > > + 0, -1, NULL, auto_nodeset,
Re: [libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA
On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote: > On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote: > >At the moment, guests that are backed by hugepages in the host are > >only able to use policy to control the placement of those hugepages > >on a per-(guest-)CPU basis. Policy applied globally is ignored. > > > >Such guests would use and > >a block with but no >.../> elements. > > > >This patch corrects this by, in this specific case, changing the QEMU > >command line from "-mem-prealloc -mem-path=..." (which cannot > >specify NUMA policy) to "-object memory-backend-file ..." (which can). > > > >Note: This is not visible to the guest and does not appear to create > >a migration incompatibility. > > > > It could make sense, I haven't tried yet, though. However, I still > don't see the point in using memory-backend-file. Is it that when you > don't have cpuset cgroup the allocation doesn't work well? Because it > certainly does work for me. Thanks for taking a look at this :-) The point of using a memory-backend-file is that with it, the NUMA policy can be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell QEMU to apply NUMA policy in the right place. It does seem odd to me to use memory-backend-file without attaching the backend to a guest NUMA node, but it seems to do the right thing in this case. (If there are guest NUMA nodes, or if hugepages aren't being used, policy is correctly applied.) I'll describe my test case in detail, perhaps there's something I don't understand happening. * I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages on node 1, and none on node 0. * I create a 2G guest using virt-install: virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le * I "virsh destroy" and then "virsh edit" to add this block to the guest XML: * "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy). * "numastat -p $(pidof qemu-system-ppc64)" shows something like this: Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) Node 0 Node 1 Total --- --- --- Huge 0.00 2048.00 2048.00 Heap 8.120.008.12 Stack0.030.000.03 Private 35.806.10 41.90 --- --- --- Total 43.95 2054.10 2098.05 So it looks like it's allocated hugepages from node 1, isn't this violating the policy I set via numatune? > >Signed-off-by: Sam Bobroff > >--- > >There was some discussion leading up to this patch, here: > > > >https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html > > > >During that discussion it seemed that fixing this issue would cause > >migration incompatibility but after some careful testing, it appears > >that it only does when a memory-backend object is attached to a guest > >NUMA node and that is not the case here. If only one is created, and > >used globally (not attached via mem=), the migration data does not > >seem to be changed and so it seems reasonable to try something like > >this patch. > > > >This patch does work for my test cases but I don't claim a deep > >understanding of the libvirt code so this is at least partly a RFC. > >Comments welcome :-) > > > >Cheers, > >Sam. > > > >src/qemu/qemu_command.c | 55 > >- > >src/qemu/qemu_command.h | 2 +- > >2 files changed, 46 insertions(+), 11 deletions(-) > > > >diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c > >index 0804133..c28c8f2 100644 > >--- a/src/qemu/qemu_command.c > >+++ b/src/qemu/qemu_command.c > >@@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, > > int guestNode, > > virBitmapPtr userNodeset, > > virBitmapPtr autoNodeset, > >- virDomainDefPtr def, > >+ const virDomainDefPtr def, > > virQEMUCapsPtr qemuCaps, > > virQEMUDriverConfigPtr cfg, >
[libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored. Such guests would use and a block with but no elements. This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can). Note: This is not visible to the guest and does not appear to create a migration incompatibility. Signed-off-by: Sam Bobroff --- There was some discussion leading up to this patch, here: https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html During that discussion it seemed that fixing this issue would cause migration incompatibility but after some careful testing, it appears that it only does when a memory-backend object is attached to a guest NUMA node and that is not the case here. If only one is created, and used globally (not attached via mem=), the migration data does not seem to be changed and so it seems reasonable to try something like this patch. This patch does work for my test cases but I don't claim a deep understanding of the libvirt code so this is at least partly a RFC. Comments welcome :-) Cheers, Sam. src/qemu/qemu_command.c | 55 - src/qemu/qemu_command.h | 2 +- 2 files changed, 46 insertions(+), 11 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 0804133..c28c8f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, int guestNode, virBitmapPtr userNodeset, virBitmapPtr autoNodeset, - virDomainDefPtr def, + const virDomainDefPtr def, virQEMUCapsPtr qemuCaps, virQEMUDriverConfigPtr cfg, const char **backendType, @@ -7129,12 +7129,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd, static int qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, -const virDomainDef *def, +const virDomainDefPtr def, virQEMUCapsPtr qemuCaps, -virCommandPtr cmd) +virCommandPtr cmd, +virBitmapPtr auto_nodeset) { const long system_page_size = virGetSystemPageSizeKB(); char *mem_path = NULL; +virBitmapPtr nodemask = NULL; +const char *backendType = NULL; +char *backendStr = NULL; +virJSONValuePtr props; +int rv = -1; /* * No-op if hugepages were not requested. @@ -7159,18 +7165,47 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) < 0) return -1; -virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); +if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset, + &nodemask, -1) < 0) +return -1; +if (nodemask && virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_FILE)) { +props = virJSONValueNewObject(); +if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def), + 0, -1, NULL, auto_nodeset, + def, qemuCaps, cfg, &backendType, + &props, false) < 0) +goto cleanup; +if (!(backendStr = virQEMUBuildObjectCommandlineFromJSON(backendType, + "mem", + props))) +goto cleanup; +virCommandAddArgList(cmd, "-object", backendStr, NULL); +rv = 0; +cleanup: +VIR_FREE(backendStr); +VIR_FREE(props); +} +else { +if (nodemask || 1) +virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Memory file backend objects are " + "unsupported by QEMU binary. Global NUMA " + "hugepage policy will be ignored.")); +virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); +rv = 0; +} VIR_FREE(mem_path); -return 0; +return rv; } static int qemuBuildMemCommandLine(virCommandPtr cmd, virQEMUDriverConfigPtr cfg, -const virDomainDef *def, -virQEMUCapsPtr qemuCaps) +const virDomainDefPtr def, +virQEMUCapsPtr q
Re: [libvirt] Question about QEMU + hugepages + NUMA memory + migration
On Tue, Oct 04, 2016 at 11:28:15AM +0200, Martin Kletzander wrote: > On Tue, Oct 04, 2016 at 02:34:57PM +1100, Sam Bobroff wrote: > >On Mon, Oct 03, 2016 at 04:27:25PM +0200, Martin Kletzander wrote: > >>On Mon, Aug 01, 2016 at 02:01:22PM +1000, Sam Bobroff wrote: > >>>Hi libvirt people, > >>> > >>>I've been looking at a (probable) bug and I'm not sure how to progress. The > >>>situation is a bit complicated and involves both QEMU and libvirt (and I > >>>think > >>>it may have been looked at already) so I would really appreciate some > >>>advice on > >>>how to approach it. I'm using a pretty recent master version of libvirt > >>>from > >>>git and I'm testing on a ppc64le host with a similar guest but this doesn't > >>>seem to be arch-specific. > >>> > >> > >>Sorry I haven't replied earlier and I'm respawning this thread now, I > >>just noticed this thread being marked for replying only after I fixed > >>similar thing to what you describe. > > > >No problem! :-) > > > >>>If I create a QEMU guest (e.g. via virt-install) that requests both > >>>hugepage > >>>backing on the host and NUMA memory placement on the host, the NUMA > >>>placement > >>>seems to be ignored. If I do: > >>> > >>># echo 0 > /proc/sys/vm/nr_hugepages > >>># echo 512 > > >>>/sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages > >>># virt-install --name tmp --memory=4096 --graphics none --memorybacking > >>>hugepages=yes --disk none --import --wait 0 --numatune=8 > >>> > >> > >>So to be clear, we want to use 16M hugepages, but only allocated from > >>node 8 while the only allocated ones are on node 0. > > > >Yes. > > > >>>... then hugepages are allocated on node 0 and the machine starts > >>>successfully, > >>>which seems like a bug. > >>> > >> > >>This definitely is a bug. But I'm afraid it's not in libvirt. I'll do > >>some explaining first. > > > >OK. > > > >>>I believe it should fail to start due to insufficient memory, and in fact > >>>that > >>>is what happens if cgroup support isn't detected in the host: there seems > >>>to be > >>>a fall-back path in libvirt (probably using mbind()) that works as I would > >>>expect. > >>> > >> > >>Yes, we are using multiple things to enforce NUMA binding: > >> > >> 1) cgroups - This restricts all the allocations made on the account of > >> the process, even when KVM does that. We cannot use that > >> until the QEMU process is running because it needs to > >> allocate some data from the DMA region which is usually > >> only on one node. > >> > >> 2) numactl's mbind() - it doesn't apply to kernel allocations, so > >>whatever KVM allocates it's not restricted by > >>this. We use this always on the process before > >>exec()-ing qemu binary (if compiled with > >>support to numactl, of course) > >> > >> 3) memory-backend-file's host-nodes parameter - This is the best > >> option that is used when QEMU > >> supports it, but due to migration > >> compatibility we use it only when > >> requested with > >> elements. > > > >OK good, that matches what I thought I'd worked out :-) > > > >>>Note: the relevant part of the guest XML seems to be this: > >>>»··· > >>>»···»··· > >>>»··· > >>>»··· > >>>»···»··· > >>>»··· > >>> > >>>It seems fairly clear what is happening: although QEMU is capable of > >>>allocating > >>>hugepages on specific NUMA nodes (using "memory-backend-file") libvirt is > >>>not > >>>passing those options to QEMU in this situation. > >>> > >> > >>I'm guessing you're talking about the host-nodes= parameter. > > > >I am, yes. > &g
Re: [libvirt] Question about QEMU + hugepages + NUMA memory + migration
On Mon, Oct 03, 2016 at 04:27:25PM +0200, Martin Kletzander wrote: > On Mon, Aug 01, 2016 at 02:01:22PM +1000, Sam Bobroff wrote: > >Hi libvirt people, > > > >I've been looking at a (probable) bug and I'm not sure how to progress. The > >situation is a bit complicated and involves both QEMU and libvirt (and I > >think > >it may have been looked at already) so I would really appreciate some advice > >on > >how to approach it. I'm using a pretty recent master version of libvirt from > >git and I'm testing on a ppc64le host with a similar guest but this doesn't > >seem to be arch-specific. > > > > Sorry I haven't replied earlier and I'm respawning this thread now, I > just noticed this thread being marked for replying only after I fixed > similar thing to what you describe. No problem! :-) > >If I create a QEMU guest (e.g. via virt-install) that requests both hugepage > >backing on the host and NUMA memory placement on the host, the NUMA placement > >seems to be ignored. If I do: > > > ># echo 0 > /proc/sys/vm/nr_hugepages > ># echo 512 > > >/sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages > ># virt-install --name tmp --memory=4096 --graphics none --memorybacking > >hugepages=yes --disk none --import --wait 0 --numatune=8 > > > > So to be clear, we want to use 16M hugepages, but only allocated from > node 8 while the only allocated ones are on node 0. Yes. > >... then hugepages are allocated on node 0 and the machine starts > >successfully, > >which seems like a bug. > > > > This definitely is a bug. But I'm afraid it's not in libvirt. I'll do > some explaining first. OK. > >I believe it should fail to start due to insufficient memory, and in fact > >that > >is what happens if cgroup support isn't detected in the host: there seems to > >be > >a fall-back path in libvirt (probably using mbind()) that works as I would > >expect. > > > > Yes, we are using multiple things to enforce NUMA binding: > > 1) cgroups - This restricts all the allocations made on the account of > the process, even when KVM does that. We cannot use that > until the QEMU process is running because it needs to > allocate some data from the DMA region which is usually > only on one node. > > 2) numactl's mbind() - it doesn't apply to kernel allocations, so > whatever KVM allocates it's not restricted by > this. We use this always on the process before > exec()-ing qemu binary (if compiled with > support to numactl, of course) > > 3) memory-backend-file's host-nodes parameter - This is the best > option that is used when QEMU > supports it, but due to migration > compatibility we use it only when > requested with > elements. OK good, that matches what I thought I'd worked out :-) > >Note: the relevant part of the guest XML seems to be this: > >»··· > >»···»··· > >»··· > >»··· > >»···»··· > >»··· > > > >It seems fairly clear what is happening: although QEMU is capable of > >allocating > >hugepages on specific NUMA nodes (using "memory-backend-file") libvirt is not > >passing those options to QEMU in this situation. > > > > I'm guessing you're talking about the host-nodes= parameter. I am, yes. > >I investigated this line of reasoning and if I hack libvirt to pass those > >options to QEMU it does indeed fix the problem... but it renders the machine > >state migration-incompatible with unfixed versions. This seems to have been > >why > >this hasn't been fixed already :-( > > > >So what can we do? > > > > From virt-install POV I would suggest adding some functionality that > would probe whether it works with and use it if possible. Or > dig deep down in kvm or qemu to see why the allocations do not conform > to the mbind(). Sorry, I don't think I understand your answer. You seem to be saying that it doesn't matter that the above XML causes memory to be allocated incorrectly, as long as we don't use it (by changing virt-install to use something else). But shouldn't the XML work correctly even if it was generated by hand? (Sorry if I'm misunderstanding!) And
Re: [libvirt] [PATCH] qemu: Only use memory-backend-file with NUMA if needed
On Fri, Sep 23, 2016 at 03:20:56PM +0200, Martin Kletzander wrote: > If this reminds you of a commit message from around a year ago, it's > 41c2aa729f0af084ede95ee9a06219a2dd5fb5df and yes, we're dealing with > "the same thing" again. Or f309db1f4d51009bad0d32e12efc75530b66836b and > it's similar. > > There is a logic in place that if there is no real need for > memory-backend-file, qemuBuildMemoryBackendStr() returns 0. However > that wasn't the case with hugepage backing. The reason for that was > that we abused the 'pagesize' variable for storing that information, but > we should rather have a separate one that specifies whether we really > need the new object for hugepage backing. And that variable should be > set only if this particular NUMA cell needs special treatment WRT > hugepages. > > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1372153 > > Signed-off-by: Martin Kletzander Hi Martin, I tested your patch and it works but it seems to make the snapshot/migration information incompatible with the current master code. e.g. after applying the patch I can't load a snapshot that was created before applying the patch (obviously only for a guest configuration that is affected by the patch). The error I get is this: $ virsh snapshot-revert tmp --snapshotname 1475026515 error: operation failed: Unknown ramblock "/objects/ram-node0", cannot accept migration error while loading state for instance 0x0 of device 'ram' Error -22 while loading VM state Presumably this is caused by the same change that fixes some other migrations, but is this case a problem? Cheers, Sam. > --- > > Notes: > This fixes migration from older libvirts. By "older", I mean > pre-(circa-)1.2.7, also in some cases pre-1.2.11, in some other cases > pre-v1.2.20. It's pretty messy. It could be back-ported as far as it's > easy to do. > > src/qemu/qemu_command.c | 8 +--- > tests/qemuxml2argvdata/qemuxml2argv-hugepages-pages2.args | 10 -- > 2 files changed, 9 insertions(+), 9 deletions(-) > > diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c > index 1ac0fa116310..316aced5124a 100644 > --- a/src/qemu/qemu_command.c > +++ b/src/qemu/qemu_command.c > @@ -3202,6 +3202,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, > int ret = -1; > virJSONValuePtr props = NULL; > bool nodeSpecified = virDomainNumatuneNodeSpecified(def->numa, > guestNode); > +bool needHugepage = !!pagesize; > > *backendProps = NULL; > *backendType = NULL; > @@ -3224,10 +3225,10 @@ qemuBuildMemoryBackendStr(unsigned long long size, > mode = VIR_DOMAIN_NUMATUNE_MEM_STRICT; > > if (pagesize == 0) { > +bool thisHugepage = false; > + > /* Find the huge page size we want to use */ > for (i = 0; i < def->mem.nhugepages; i++) { > -bool thisHugepage = false; > - > hugepage = &def->mem.hugepages[i]; > > if (!hugepage->nodemask) { > @@ -3249,6 +3250,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, > > if (thisHugepage) { > /* Hooray, we've found the page size */ > +needHugepage = true; > break; > } > } > @@ -3335,7 +3337,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, > } > > /* If none of the following is requested... */ > -if (!pagesize && !userNodeset && !memAccess && !nodeSpecified && !force) > { > +if (!needHugepage && !userNodeset && !memAccess && !nodeSpecified && > !force) { > /* report back that using the new backend is not necessary > * to achieve the desired configuration */ > ret = 1; > diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hugepages-pages2.args > b/tests/qemuxml2argvdata/qemuxml2argv-hugepages-pages2.args > index 447bb52d00b7..a37f03d0d2ed 100644 > --- a/tests/qemuxml2argvdata/qemuxml2argv-hugepages-pages2.args > +++ b/tests/qemuxml2argvdata/qemuxml2argv-hugepages-pages2.args > @@ -10,12 +10,10 @@ QEMU_AUDIO_DRV=none \ > -M pc \ > -m 1024 \ > -smp 2,sockets=2,cores=1,threads=1 \ > --object memory-backend-file,id=ram-node0,prealloc=yes,\ > -mem-path=/dev/hugepages2M/libvirt/qemu,size=268435456 \ > --numa node,nodeid=0,cpus=0,memdev=ram-node0 \ > --object memory-backend-file,id=ram-node1,prealloc=yes,\ > -mem-path=/dev/hugepages2M/libvirt/qemu,size=805306368 \ > --numa node,nodeid=1,cpus=1,memdev=ram-node1 \ > +-mem-prealloc \ > +-mem-path /dev/hugepages2M/libvirt/qemu \ > +-numa node,nodeid=0,cpus=0,mem=256 \ > +-numa node,nodeid=1,cpus=1,mem=768 \ > -uuid ef1bdff4-27f3-4e85-a807-5fb4d58463cc \ > -nographic \ > -nodefaults \ > -- > 2.10.0 > > -- > libvir-list mailing list > libvir-list@redhat.com > https://www.redhat.com/mailman/listinfo/libvir-list -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] qemu: Bug: NUMA memory policy ignored in some situations
Hi all, I've found that if I create a QEMU guest that: (a) is backed by host hugepages and (b) specifies a host NUMA memory placement policy ... then the NUMA policy (b) is ignored. It's even possible to start the guest if the policy specifies that memory must come from non-existant host nodes. Is this a bug? (I ask because I suspect that it is a bug, and that it can't be fixed trivially due to migration considerations. Is that true?) A few more details: I do (a) using this in the guest XML: And (b) using: If the XML contains and to bind guest NUMA nodes to host NUMA nodes then the policy is followed and this behaviour isn't seen. The guest can't be started if the policy specifies invalid host nodes. (Strangely, if the host doesn't have the cpuset cgroup mount set up or otherwise lacks cgroup support, then the policy is not ignored. This appears to be due to a fallback path in libvirt so it seems like NUMA memory policy is intended to work in this situation.) Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] Question about QEMU + hugepages + NUMA memory + migration
Hi libvirt people, I've been looking at a (probable) bug and I'm not sure how to progress. The situation is a bit complicated and involves both QEMU and libvirt (and I think it may have been looked at already) so I would really appreciate some advice on how to approach it. I'm using a pretty recent master version of libvirt from git and I'm testing on a ppc64le host with a similar guest but this doesn't seem to be arch-specific. If I create a QEMU guest (e.g. via virt-install) that requests both hugepage backing on the host and NUMA memory placement on the host, the NUMA placement seems to be ignored. If I do: # echo 0 > /proc/sys/vm/nr_hugepages # echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages # virt-install --name tmp --memory=4096 --graphics none --memorybacking hugepages=yes --disk none --import --wait 0 --numatune=8 ... then hugepages are allocated on node 0 and the machine starts successfully, which seems like a bug. I believe it should fail to start due to insufficient memory, and in fact that is what happens if cgroup support isn't detected in the host: there seems to be a fall-back path in libvirt (probably using mbind()) that works as I would expect. Note: the relevant part of the guest XML seems to be this: »··· »···»··· »··· »··· »···»··· »··· It seems fairly clear what is happening: although QEMU is capable of allocating hugepages on specific NUMA nodes (using "memory-backend-file") libvirt is not passing those options to QEMU in this situation. I investigated this line of reasoning and if I hack libvirt to pass those options to QEMU it does indeed fix the problem... but it renders the machine state migration-incompatible with unfixed versions. This seems to have been why this hasn't been fixed already :-( So what can we do? I assume it's not acceptible to just break migration with a bugfix, and I can only think of two ways to fix migration: (a) Add a new flag to the XML, and for guests without the flag, maintain the old buggy behaviour (and therefore migration compatability). (b) Hack QEMU so that migration can succeed between un-fixed and fixed versions. (And possibly also in the reverse direction?) I don't like (a) because it's visible in the XML, and would have to be carried forever (or at least a long time?). I don't really like (b) either because it's tricky, and even if it could be made to work reliably, it would add mess and risk to the migration code. I'm not sure how the QEMU community would feel about it either. However, I did hack up some code and it worked at least in some simple cases. Can anyone see a better approach? Is anyone already working on this? Thanks, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] fix
On 14/08/14 20:14, Ján Tomko wrote: > On 08/14/2014 11:52 AM, Michal Privoznik wrote: >> On 14.08.2014 10:41, Ján Tomko wrote: >>> Also add qemuDomainChangeGraphicsPasswords, >>> qemuProcessVerifyGuestCPU and qemuProcessInitPCIAddresses. >>> >>> Replace tabs by spaces. >>> --- >>> I'll do some testing on the patch and push it later with this squashed in >>> if it works well. >>> >>> src/qemu/qemu_domain.c | 2 +- >>> src/qemu/qemu_hotplug.c | 12 >>> src/qemu/qemu_hotplug.h | 3 ++- >>> src/qemu/qemu_process.c | 33 - >>> 4 files changed, 31 insertions(+), 19 deletions(-) >>> >> >> ACK to the original with this squashed in. >> > > Now pushed. > > Jan > > Jan, Michal, Thanks for your help :-) (I'll be more careful with formatting in future.) Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH v2 1/1] qemu: Tidy up job handling during live migration
During a QEMU live migration several warning messages about job handling could be written to syslog on the destination host: "entering monitor without asking for a nested job is dangerous" The messages are written because the job handling during migration uses hard coded asyncJob values in several places that are incorrect. This patch passes the required asyncJob value around and prevents the warnings as well as any issues that the warnings may be referring to. Signed-off-by: Sam Bobroff --- v2: * Handle failures from qemuDomainObjEnterMonitorAsync() rather than ignoring them. * Include qemuDomainChangeGraphicsPasswords(). src/qemu/qemu_domain.c| 6 -- src/qemu/qemu_domain.h| 2 +- src/qemu/qemu_driver.c| 21 - src/qemu/qemu_migration.c | 3 ++- src/qemu/qemu_process.c | 46 -- src/qemu/qemu_process.h | 1 + 6 files changed, 48 insertions(+), 31 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 4f63c88..59b2647 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2497,7 +2497,8 @@ qemuDomainDetermineDiskChain(virQEMUDriverPtr driver, int qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, - virDomainObjPtr vm) + virDomainObjPtr vm, + int asyncJob) { qemuDomainObjPrivatePtr priv = vm->privateData; char **aliases; @@ -2505,7 +2506,8 @@ qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE_DEL_EVENT)) return 0; -qemuDomainObjEnterMonitor(driver, vm); +if (qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob) < 0) + return -1; if (qemuMonitorGetDeviceAliases(priv->mon, &aliases) < 0) { qemuDomainObjExitMonitor(driver, vm); return -1; diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 67972b9..8736889 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -369,7 +369,7 @@ extern virDomainXMLNamespace virQEMUDriverDomainXMLNamespace; extern virDomainDefParserConfig virQEMUDriverDomainDefParserConfig; int qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, - virDomainObjPtr vm); + virDomainObjPtr vm, int asyncJob); bool qemuDomainDefCheckABIStability(virQEMUDriverPtr driver, virDomainDefPtr src, diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 33541d3..b0439d2 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1616,7 +1616,8 @@ static virDomainPtr qemuDomainCreateXML(virConnectPtr conn, goto cleanup; } -if (qemuProcessStart(conn, driver, vm, NULL, -1, NULL, NULL, +if (qemuProcessStart(conn, driver, vm, QEMU_ASYNC_JOB_NONE, + NULL, -1, NULL, NULL, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, start_flags) < 0) { virDomainAuditStart(vm, "booted", false); @@ -5446,7 +5447,8 @@ qemuDomainSaveImageStartVM(virConnectPtr conn, } /* Set the migration source and start it up. */ -ret = qemuProcessStart(conn, driver, vm, "stdio", *fd, path, NULL, +ret = qemuProcessStart(conn, driver, vm, QEMU_ASYNC_JOB_NONE, + "stdio", *fd, path, NULL, VIR_NETDEV_VPORT_PROFILE_OP_RESTORE, VIR_QEMU_PROCESS_START_PAUSED); @@ -6143,7 +6145,8 @@ qemuDomainObjStart(virConnectPtr conn, } } -ret = qemuProcessStart(conn, driver, vm, NULL, -1, NULL, NULL, +ret = qemuProcessStart(conn, driver, vm, QEMU_ASYNC_JOB_NONE, + NULL, -1, NULL, NULL, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, start_flags); virDomainAuditStart(vm, "booted", ret >= 0); if (ret >= 0) { @@ -6500,7 +6503,7 @@ qemuDomainAttachDeviceLive(virDomainObjPtr vm, } if (ret == 0) -qemuDomainUpdateDeviceList(driver, vm); +qemuDomainUpdateDeviceList(driver, vm, QEMU_ASYNC_JOB_NONE); return ret; } @@ -6560,7 +6563,7 @@ qemuDomainDetachDeviceLive(virDomainObjPtr vm, } if (ret == 0) -qemuDomainUpdateDeviceList(driver, vm); +qemuDomainUpdateDeviceList(driver, vm, QEMU_ASYNC_JOB_NONE); return ret; } @@ -14101,8 +14104,8 @@ static int qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, if (config) virDomainObjAssignDef(vm, config, false, NULL); -rc = qemuProcessStart(snapshot->domain->conn, - driver, vm, NULL, -1, NULL, snap, +rc = qemuProcessStart(snapshot->domain->conn, driver, vm, + QEMU_ASYNC_JOB_NONE, N
Re: [libvirt] [PATCH 1/1] qemu: Tidy up job handling during live migration
On 02/08/14 00:04, Ján Tomko wrote: > On 08/01/2014 03:12 AM, Sam Bobroff wrote: >> On 30/07/14 01:52, Ján Tomko wrote: >>> On 07/29/2014 02:41 AM, Sam Bobroff wrote: >>>> During a QEMU live migration several warning messages about job >>>> handling could be written to syslog on the destination host: >>>> >>>> "entering monitor without asking for a nested job is dangerous" >>>> >>>> The messages are written because the job handling during migration >>>> uses hard coded asyncJob values in several places that are incorrect. >>>> >>>> This patch passes the required asyncJob value around and prevents >>>> the warnings as well as any issues that the warnings may be referring >>>> to. >>>> >>>> Signed-off-by: Sam Bobroff >>>> -- >>> >>> This patch seems to fix the deadlock that can happen if the migrated domain >>> is >>> destroyed at the destination reported here: >>> https://www.redhat.com/archives/libvir-list/2014-May/msg00236.html >>> >>> It looks good to me, but it seems there are more functions calling >>> qemuDomainObjEnterMonitor that can be called from qemuProcessStart, >>> for example qemuDomainChangeGraphicsPasswords. >> >> Yes, I was fairly sure there would be other cases; I just fixed all the >> ones that actually occurred during my tests. >> >> In fact it seems like for the cases I'm looking at here, where it's the >> async job owner thread that's using the EnterMonitor functions, passing >> asyncJob around is a waste of time anyway because we know the correct >> value of asyncJob to use: it's stored in priv->job.asyncJob. >> >> Why not have qemuDomainObjEnterMonitorInternal() automatically switch to >> creating a nested job in this case? >> >> It seems easy to do and would simplify some code as well; what do you think? > > We've had it that way before - it didn't work that well: > http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=193cd0f3 Interesting. I'll stick to the simpler fix for now then. >> >>>> @@ -2505,7 +2506,7 @@ qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, >>>> if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE_DEL_EVENT)) >>>> return 0; >>>> >>>> -qemuDomainObjEnterMonitor(driver, vm); >>>> +ignore_value(qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob)); >>> >>> Also, the return value of this call could be dangerous to ignore if asyncJob >>> != NONE. >> >> True, but the patch hasn't introduced this, and the full story is even >> worse ;-) >> >> void qemuDomainObjEnterMonitor(virQEMUDriverPtr driver, >>virDomainObjPtr obj) >> { >> ignore_value(qemuDomainObjEnterMonitorInternal(driver, obj, >>QEMU_ASYNC_JOB_NONE)); >> } >> > > qemuDomainObjEnterMonitorInternal is called with QEMU_ASYNC_JOB_NONE here. It > always returns 0 in that case and it's safe to ignore. The problem is when you > use other async jobs: > > static int > qemuDomainObjEnterMonitorInternal(virQEMUDriverPtr driver, > virDomainObjPtr obj, > qemuDomainAsyncJob asyncJob) > { > qemuDomainObjPrivatePtr priv = obj->privateData; > > if (asyncJob != QEMU_ASYNC_JOB_NONE) { > int ret; > if ((ret = qemuDomainObjBeginNestedJob(driver, obj, asyncJob)) < 0) > return ret; > if (!virDomainObjIsActive(obj)) { > virReportError(VIR_ERR_OPERATION_FAILED, "%s", >_("domain is no longer running")); > /* Still referenced by the containing async job. */ > ignore_value(qemuDomainObjEndJob(driver, obj)); > return -1; > } > } > ... Ah thanks, I see what you mean. I'll post an updated version of my patch that handles error results and I'll include the qemuDomainChangeGraphicsPasswords() (via qemuProcessInitPasswords()) function you mentioned above as well. I'd be happy to update other functions as well, but I can't see a simple way of finding every way that qemuProcessStart() might end up calling qemuDomainObjEnterMonitorInternal(). If you can recommend one I'll handle them as well, but otherwise would you accept the patch with only the known ones fixed? (It would at least be an improvement and would make it easy to fix the other cases as they are found.) > Jan > Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 1/1] qemu: Tidy up job handling during live migration
On 30/07/14 01:52, Ján Tomko wrote: > On 07/29/2014 02:41 AM, Sam Bobroff wrote: >> During a QEMU live migration several warning messages about job >> handling could be written to syslog on the destination host: >> >> "entering monitor without asking for a nested job is dangerous" >> >> The messages are written because the job handling during migration >> uses hard coded asyncJob values in several places that are incorrect. >> >> This patch passes the required asyncJob value around and prevents >> the warnings as well as any issues that the warnings may be referring >> to. >> >> Signed-off-by: Sam Bobroff >> -- > > This patch seems to fix the deadlock that can happen if the migrated domain is > destroyed at the destination reported here: > https://www.redhat.com/archives/libvir-list/2014-May/msg00236.html > > It looks good to me, but it seems there are more functions calling > qemuDomainObjEnterMonitor that can be called from qemuProcessStart, > for example qemuDomainChangeGraphicsPasswords. Yes, I was fairly sure there would be other cases; I just fixed all the ones that actually occurred during my tests. In fact it seems like for the cases I'm looking at here, where it's the async job owner thread that's using the EnterMonitor functions, passing asyncJob around is a waste of time anyway because we know the correct value of asyncJob to use: it's stored in priv->job.asyncJob. Why not have qemuDomainObjEnterMonitorInternal() automatically switch to creating a nested job in this case? It seems easy to do and would simplify some code as well; what do you think? >> @@ -2505,7 +2506,7 @@ qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, >> if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE_DEL_EVENT)) >> return 0; >> >> -qemuDomainObjEnterMonitor(driver, vm); >> +ignore_value(qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob)); > > Also, the return value of this call could be dangerous to ignore if asyncJob > != NONE. True, but the patch hasn't introduced this, and the full story is even worse ;-) void qemuDomainObjEnterMonitor(virQEMUDriverPtr driver, virDomainObjPtr obj) { ignore_value(qemuDomainObjEnterMonitorInternal(driver, obj, QEMU_ASYNC_JOB_NONE)); } > > Jan Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCH 1/1] qemu: Tidy up job handling during live migration
During a QEMU live migration several warning messages about job handling could be written to syslog on the destination host: "entering monitor without asking for a nested job is dangerous" The messages are written because the job handling during migration uses hard coded asyncJob values in several places that are incorrect. This patch passes the required asyncJob value around and prevents the warnings as well as any issues that the warnings may be referring to. Signed-off-by: Sam Bobroff --- src/qemu/qemu_domain.c| 5 +++-- src/qemu/qemu_domain.h| 2 +- src/qemu/qemu_driver.c| 21 - src/qemu/qemu_migration.c | 3 ++- src/qemu/qemu_process.c | 33 ++--- src/qemu/qemu_process.h | 1 + 6 files changed, 37 insertions(+), 28 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 4f63c88..3abbb14 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2497,7 +2497,8 @@ qemuDomainDetermineDiskChain(virQEMUDriverPtr driver, int qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, - virDomainObjPtr vm) + virDomainObjPtr vm, + int asyncJob) { qemuDomainObjPrivatePtr priv = vm->privateData; char **aliases; @@ -2505,7 +2506,7 @@ qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE_DEL_EVENT)) return 0; -qemuDomainObjEnterMonitor(driver, vm); +ignore_value(qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob)); if (qemuMonitorGetDeviceAliases(priv->mon, &aliases) < 0) { qemuDomainObjExitMonitor(driver, vm); return -1; diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 67972b9..8736889 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -369,7 +369,7 @@ extern virDomainXMLNamespace virQEMUDriverDomainXMLNamespace; extern virDomainDefParserConfig virQEMUDriverDomainDefParserConfig; int qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, - virDomainObjPtr vm); + virDomainObjPtr vm, int asyncJob); bool qemuDomainDefCheckABIStability(virQEMUDriverPtr driver, virDomainDefPtr src, diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 33541d3..b0439d2 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1616,7 +1616,8 @@ static virDomainPtr qemuDomainCreateXML(virConnectPtr conn, goto cleanup; } -if (qemuProcessStart(conn, driver, vm, NULL, -1, NULL, NULL, +if (qemuProcessStart(conn, driver, vm, QEMU_ASYNC_JOB_NONE, + NULL, -1, NULL, NULL, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, start_flags) < 0) { virDomainAuditStart(vm, "booted", false); @@ -5446,7 +5447,8 @@ qemuDomainSaveImageStartVM(virConnectPtr conn, } /* Set the migration source and start it up. */ -ret = qemuProcessStart(conn, driver, vm, "stdio", *fd, path, NULL, +ret = qemuProcessStart(conn, driver, vm, QEMU_ASYNC_JOB_NONE, + "stdio", *fd, path, NULL, VIR_NETDEV_VPORT_PROFILE_OP_RESTORE, VIR_QEMU_PROCESS_START_PAUSED); @@ -6143,7 +6145,8 @@ qemuDomainObjStart(virConnectPtr conn, } } -ret = qemuProcessStart(conn, driver, vm, NULL, -1, NULL, NULL, +ret = qemuProcessStart(conn, driver, vm, QEMU_ASYNC_JOB_NONE, + NULL, -1, NULL, NULL, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, start_flags); virDomainAuditStart(vm, "booted", ret >= 0); if (ret >= 0) { @@ -6500,7 +6503,7 @@ qemuDomainAttachDeviceLive(virDomainObjPtr vm, } if (ret == 0) -qemuDomainUpdateDeviceList(driver, vm); +qemuDomainUpdateDeviceList(driver, vm, QEMU_ASYNC_JOB_NONE); return ret; } @@ -6560,7 +6563,7 @@ qemuDomainDetachDeviceLive(virDomainObjPtr vm, } if (ret == 0) -qemuDomainUpdateDeviceList(driver, vm); +qemuDomainUpdateDeviceList(driver, vm, QEMU_ASYNC_JOB_NONE); return ret; } @@ -14101,8 +14104,8 @@ static int qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, if (config) virDomainObjAssignDef(vm, config, false, NULL); -rc = qemuProcessStart(snapshot->domain->conn, - driver, vm, NULL, -1, NULL, snap, +rc = qemuProcessStart(snapshot->domain->conn, driver, vm, + QEMU_ASYNC_JOB_NONE, NULL, -1, NULL, snap, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, VIR_QEMU_PROCESS_START_PAUSED);
Re: [libvirt] Syslog "nested job is dangerous" message
On 17/07/14 17:54, Ján Tomko wrote: > On 07/17/2014 08:51 AM, Jiri Denemark wrote: >> On Thu, Jul 17, 2014 at 13:29:52 +1000, Sam Bobroff wrote: >>> On 17/07/14 12:50, Eric Blake wrote: >>>> On 07/16/2014 07:52 PM, Sam Bobroff wrote: >>>>> Hello everyone, >>>> >>>> [Can you configure your mailer to wrap long lines?] >>> >>> [No problem, done.] >>> >>>>> After performing a migration, the syslog often contains several >>>>> messages like this: >>>>> >>>>> "This thread seems to be the async job owner; entering monitor >>>>> without asking for a nested job is dangerous" >>>> >>>> The sign of a bug that we need to fix. What version of libvirt are >>>> you using? We may have already fixed it. >>> >>> I've been testing on 1.2.6, but I will re-test on the latest code from git. >> >> Hmm, it what were you doing when the message appeared in the log? I >> fixed wrong usage of our job tracking APIs long time ago >> (e4e28220b572cf4c71d07740c24150411f0704a6 in 1.0.3) so I guess there >> must be another place we don't do it correctly. >> > > That commit was essentially reverted by my commit 9846402 when fixing > migration crashes: https://bugzilla.redhat.com/show_bug.cgi?id=1018267 > > I think the message was harmless during migration as we don't allow other jobs > (except destroy). > > Jan I've re-tested this on the latest libvirt from git (1.2.7) and I still see several occurrences of the "dangerous" message on the destination machine after performing a live migration with --copy-storage-all. It seems like qemuDomainObjEnterMonitorInternal() is expecting the asyncJob parameter to have the value of the currently running async job, and in the cases that cause the warning message, it's being called via qemuDomainObjEnterMonitor() which hard codes it to QEMU_ASYNC_NONE. It looks like these calls should be changed to qemuDomainObjEnterMonitorAsync() with the right asyncJob parameter, which I believe is QEMU_ASYNC_JOB_MIGRATION_IN for the cases I'm looking at (and the old value of QEMU_ASYNC_NONE for all other callers). There aren't many cases of this, so fixing it this way is fairly simple but it does involve passing the asyncJob through some intermediate functions. Another approach would be to change qemuDomainObjEnterMonitorInternal() so that it can automatically do the same thing: It could see that there was an async job running, and that it's owned by the current thread, and then call qemuDomainObjBeginNestedJob() (or directly call qemuDomainObjBeginJobInternal() since we know that the checks in qemuDomainObjBeginNestedJob() are going to succeed). It seems like this might also simplify some other code that is doing pretty much just what I described above. The second approach does fix all the cases I'm looking at, and would also fix other similar cases I haven't yet found, but I'm not sure that it would be entirely safe or if it's circumventing some sort of safety checking. Any comments? Which approach should I pursue? Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Syslog "nested job is dangerous" message
On 17/07/14 12:50, Eric Blake wrote: > On 07/16/2014 07:52 PM, Sam Bobroff wrote: >> Hello everyone, > > [Can you configure your mailer to wrap long lines?] [No problem, done.] >> After performing a migration, the syslog often contains several >> messages like this: >> >> "This thread seems to be the async job owner; entering monitor >> without asking for a nested job is dangerous" > > The sign of a bug that we need to fix. What version of libvirt are > you using? We may have already fixed it. I've been testing on 1.2.6, but I will re-test on the latest code from git. >> The message makes it sound as if the user has done something >> dangerous but, after looking at the code that produces the message, >> it appears to be more to do with how the libvirt job code is used. > > Rather, libvirt itself has done something dangerous. > >> >> The commit that added the warning >> (6eede368bc8e3df2c94c2ec1a913885ce08e99db) doesn't explain why it >> might be dangerous... > > Jiri can probably explain better, but it means there is a race > condition where libvirt can lose track of a long-running job and > cause memory corruption in its management of the guest. > >> Would it be appropriate to re-word the message, or perhaps change >> it to a debug message so that it's not normally seen by users? > > No. > >> >> Is it important to track down the cases that are generating the >> warning and fix them? (Could it cause some kind of significant >> problem?) > > Yes. Acknowledged. If they still occur on the latest code I'll start tracking them down :-) Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] Syslog "nested job is dangerous" message
Hello everyone, After performing a migration, the syslog often contains several messages like this: "This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous" The message makes it sound as if the user has done something dangerous but, after looking at the code that produces the message, it appears to be more to do with how the libvirt job code is used. The commit that added the warning (6eede368bc8e3df2c94c2ec1a913885ce08e99db) doesn't explain why it might be dangerous... qemu: Warn on possibly incorrect usage of EnterMonitor* qemuDomainObjEnterMonitor{,WithDriver} should not be called from async jobs, only EnterMonitorAsync variant is allowed. ... so I would appreciate some advice from people who understand that area. Would it be appropriate to re-word the message, or perhaps change it to a debug message so that it's not normally seen by users? Is it important to track down the cases that are generating the warning and fix them? (Could it cause some kind of significant problem?) Cheers, Sam. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list