Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-13 Thread Daniel P. Berrange
On Wed, Oct 06, 2010 at 05:56:36PM -0400, Stefan Berger wrote:
  V2:
   remove the locks from qemudVMFilterRebuild  umlVMFilterRebuild
 
  This is from a bug report and conversation on IRC where Soren reported 
 that while a filter update is occurring on one or more VMs (due to a 
 rule having been edited for example), a deadlock can occur when a VM 
 referencing a filter is started.
 
 The problem is caused by the two locking sequences of
 
 qemu driver, qemu domain, filter # for the VM start operation
 filter, qemu_driver, qemu_domain# for the filter update 
 operation
 
 that obviously don't lock in the same order. The problem is the 2nd lock 
 sequence. Here the qemu_driver lock is being grabbed in 
 qemu_driver:qemudVMFilterRebuild()
 
 The following solution is based on the idea of trying to re-arrange the 
 2nd sequence of locks as follows:
 
 qemu_driver, filter, qemu_driver, qemu_domain
 
 and making the qemu driver recursively lockable so that a second lock 
 can occur, this would then lead to the following net-locking sequence
 
 qemu_driver, filter, qemu_domain
 
 where the 2nd qemu_driver lock has been ( logically ) eliminated.
 
 The 2nd part of the idea is that the sequence of locks (filter, 
 qemu_domain) and (qemu_domain, filter) becomes interchangeable if all 
 code paths where filter AND qemu_domain are locked have a preceding 
 qemu_domain lock that basically blocks their concurrent execution

I'm not entirely happy with this locking scheme, because the
broken lock ordering problem is still present. We're just 
relying on the fact that we retain the global lock to protect
us against the ongoing lock ordering issue.

That said, I don't see any immediately obvious alternative
to solve this problem. So I guess I'll ack this approach
for now. One day this will likely need re-visiting...

Daniel
-- 
|: Red Hat, Engineering, London-o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org-o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-13 Thread Daniel P. Berrange
On Thu, Oct 07, 2010 at 09:58:28AM -0400, Stefan Berger wrote:
  On 10/07/2010 09:06 AM, Soren Hansen wrote:
 I had trouble applying the patch (I think maybe Thunderbird may have
 fiddled with the formatting :( ), but after doing it manually, it works
 excellently. Thanks!
 
 Great. I will prepare a V3.
 
 I am also shooting a kill -SIGHUP at libvirt once in a while to see what 
 happens (while creating / destroying 2 VMs and modifying their filters). 
 Most of the time all goes well, but occasionally things do get stuck. I 
 get the following debugging output from libvirt and attaching gdb to 
 libvirt I see the following stack traces. Maybe Daniel can interpret 
 this... To me it looks like some of the conditions need to be 'tickled'...
 
 
 09:47:25.000: error : qemuAutostartDomain:822 : Failed to start job on 
 VM 'dummy-vm1': Timed out during operation: cannot acquire state change lock
 
 (gdb) thr ap all bt
 
 Thread 9 (Thread 0x7f49bf592710 (LWP 17464)):
 #0  0x00327680b729 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
 #1  0x00435312 in virCondWaitUntil (c=value optimized out,
 m=value optimized out, whenms=value optimized out)
 at util/threads-pthread.c:115
 #2  0x0043d0ab in qemuDomainObjBeginJobWithDriver 
 (driver=0x1f9c010,
 obj=0x7f49a00011b0) at qemu/qemu_driver.c:409
 #3  0x00458abf in qemuAutostartDomain (payload=value optimized 
 out,
 name=value optimized out, opaque=0x7f49bf591320)
 at qemu/qemu_driver.c:818
 #4  0x7f49c040ab6a in virHashForEach (table=0x1f9be20,
 iter=0x458a90 qemuAutostartDomain, data=0x7f49bf591320)
 at util/hash.c:495
 #5  0x0043cdac in qemudAutostartConfigs (driver=0x1f9c010)
 at qemu/qemu_driver.c:855
 #6  0x0043ce2a in qemudReload () at qemu/qemu_driver.c:2003
 #7  0x7f49c0450a3e in virStateReload () at libvirt.c:1017
 #8  0x004189e1 in qemudDispatchSignalEvent (
 watch=value optimized out, fd=value optimized out,
 events=value optimized out, opaque=0x1f6f830) at libvirtd.c:388
 ---Type return to continue, or q return to quit---
 #9  0x004186a9 in virEventDispatchHandles () at event.c:479
 #10 virEventRunOnce () at event.c:608
 #11 0x0041a346 in qemudOneLoop () at libvirtd.c:2217
 #12 0x0041a613 in qemudRunLoop (opaque=0x1f6f830) at libvirtd.c:2326
 #13 0x003276807761 in start_thread () from /lib64/libpthread.so.0
 #14 0x0032760e14ed in clone () from /lib64/libc.so.6


This thread shows the problem. Guests must not be run directly
from the event loop thread, because startup requires waiting
for I/O events. So this thread is sitting on the condition
variable waiting for an I/O event to complete, but because
its doing this from the event loop thread the event loop
isn't running. So the condition will never be signalled.

This is completely unrelated to the other problems discussed
in this thread  I'm surprised we've not seen it before now!

When you send SIGHUP to libvirt this triggers a reload of  the
guest domain configs. For some reason we also have this SIGHUP
re-triggering autostart. IMHO this is a very big mistake. If
a guest is marked as autostart, I don't think an admin would
expect it to be started when just sending SIGHUP. I think we
should fix it so that autostart is only ever done at daemon
startup, not SIGHUP. This would avoid the entire problem code
path here

Regards,
Daniel
-- 
|: Red Hat, Engineering, London-o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org-o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-13 Thread Stefan Berger

 On 10/13/2010 09:07 AM, Daniel P. Berrange wrote:

On Wed, Oct 06, 2010 at 05:56:36PM -0400, Stefan Berger wrote:

  V2:
   remove the locks from qemudVMFilterRebuild  umlVMFilterRebuild

  This is from a bug report and conversation on IRC where Soren reported
that while a filter update is occurring on one or more VMs (due to a
rule having been edited for example), a deadlock can occur when a VM
referencing a filter is started.

The problem is caused by the two locking sequences of

qemu driver, qemu domain, filter # for the VM start operation
filter, qemu_driver, qemu_domain# for the filter update
operation

that obviously don't lock in the same order. The problem is the 2nd lock
sequence. Here the qemu_driver lock is being grabbed in
qemu_driver:qemudVMFilterRebuild()

The following solution is based on the idea of trying to re-arrange the
2nd sequence of locks as follows:

qemu_driver, filter, qemu_driver, qemu_domain

and making the qemu driver recursively lockable so that a second lock
can occur, this would then lead to the following net-locking sequence

qemu_driver, filter, qemu_domain

where the 2nd qemu_driver lock has been ( logically ) eliminated.

The 2nd part of the idea is that the sequence of locks (filter,
qemu_domain) and (qemu_domain, filter) becomes interchangeable if all
code paths where filter AND qemu_domain are locked have a preceding
qemu_domain lock that basically blocks their concurrent execution

I'm not entirely happy with this locking scheme, because the
broken lock ordering problem is still present. We're just
relying on the fact that we retain the global lock to protect
us against the ongoing lock ordering issue.

That said, I don't see any immediately obvious alternative
to solve this problem. So I guess I'll ack this approach
for now. One day this will likely need re-visiting...


Ok, I'll push it.

 Stefan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-13 Thread Stefan Berger

 On 10/13/2010 09:11 AM, Daniel P. Berrange wrote:

On Thu, Oct 07, 2010 at 09:58:28AM -0400, Stefan Berger wrote:

  On 10/07/2010 09:06 AM, Soren Hansen wrote:

I had trouble applying the patch (I think maybe Thunderbird may have
fiddled with the formatting :( ), but after doing it manually, it works
excellently. Thanks!


Great. I will prepare a V3.

I am also shooting a kill -SIGHUP at libvirt once in a while to see what
happens (while creating / destroying 2 VMs and modifying their filters).
Most of the time all goes well, but occasionally things do get stuck. I
get the following debugging output from libvirt and attaching gdb to
libvirt I see the following stack traces. Maybe Daniel can interpret
this... To me it looks like some of the conditions need to be 'tickled'...

(gdb) thr ap all bt

Thread 9 (Thread 0x7f49bf592710 (LWP 17464)):
#0  0x00327680b729 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
#1  0x00435312 in virCondWaitUntil (c=value optimized out,
 m=value optimized out, whenms=value optimized out)
 at util/threads-pthread.c:115
#2  0x0043d0ab in qemuDomainObjBeginJobWithDriver
(driver=0x1f9c010,
 obj=0x7f49a00011b0) at qemu/qemu_driver.c:409
#3  0x00458abf in qemuAutostartDomain (payload=value optimized
out,
 name=value optimized out, opaque=0x7f49bf591320)
 at qemu/qemu_driver.c:818
#4  0x7f49c040ab6a in virHashForEach (table=0x1f9be20,
 iter=0x458a90qemuAutostartDomain, data=0x7f49bf591320)
 at util/hash.c:495
#5  0x0043cdac in qemudAutostartConfigs (driver=0x1f9c010)
 at qemu/qemu_driver.c:855
#6  0x0043ce2a in qemudReload () at qemu/qemu_driver.c:2003
#7  0x7f49c0450a3e in virStateReload () at libvirt.c:1017
#8  0x004189e1 in qemudDispatchSignalEvent (
 watch=value optimized out, fd=value optimized out,
 events=value optimized out, opaque=0x1f6f830) at libvirtd.c:388
---Typereturn  to continue, or qreturn  to quit---
#9  0x004186a9 in virEventDispatchHandles () at event.c:479
#10 virEventRunOnce () at event.c:608
#11 0x0041a346 in qemudOneLoop () at libvirtd.c:2217
#12 0x0041a613 in qemudRunLoop (opaque=0x1f6f830) at libvirtd.c:2326
#13 0x003276807761 in start_thread () from /lib64/libpthread.so.0
#14 0x0032760e14ed in clone () from /lib64/libc.so.6


This thread shows the problem. Guests must not be run directly
from the event loop thread, because startup requires waiting
for I/O events. So this thread is sitting on the condition
variable waiting for an I/O event to complete, but because
its doing this from the event loop thread the event loop
isn't running. So the condition will never be signalled.
This is completely unrelated to the other problems discussed
in this thread  I'm surprised we've not seen it before now!

Yes, it's unrelated and came up through my testing of the code paths I 
touched with the deadlock-prevention patch...

When you send SIGHUP to libvirt this triggers a reload of  the
guest domain configs. For some reason we also have this SIGHUP
re-triggering autostart. IMHO this is a very big mistake. If
a guest is marked as autostart, I don't think an admin would
expect it to be started when just sending SIGHUP. I think we
should fix it so that autostart is only ever done at daemon
startup, not SIGHUP. This would avoid the entire problem code
path here

FWIW, I don't have any VM marked as 'autostart', but the code seems to 
be doing something 'for' VMs no matter whether they are marked as 
autostart or not, i.e., run  'qemuDomainObjBeginJobWithDriver'


   Stefan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-07 Thread Soren Hansen
I had trouble applying the patch (I think maybe Thunderbird may have
fiddled with the formatting :( ), but after doing it manually, it works
excellently. Thanks!

-- 
Soren Hansen
Ubuntu Developerhttp://www.ubuntu.com/
OpenStack Developer http://www.openstack.org/

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-07 Thread Daniel Veillard
On Wed, Oct 06, 2010 at 05:56:36PM -0400, Stefan Berger wrote:
  V2:
   remove the locks from qemudVMFilterRebuild  umlVMFilterRebuild
[...]
 So, the following code paths exist towards
 qemu_driver:qemudVMFilterRebuild where we now want to put a
 qemu_driver lock in front of the filter lock.
 
 - nwfilterUndefine()   [ locks the filter ]
 - virNWFilterTestUnassignDef()
 - virNWFilterTriggerVMFilterRebuild()
 - qemudVMFilterRebuild()
 
 - nwfilterDefine()
 - virNWFilterPoolAssignDef() [ locks the filter ]
 - virNWFilterTriggerVMFilterRebuild()
 - qemudVMFilterRebuild()
 
 - nwfilterDriverReload()
 - virNWFilterPoolLoadAllConfigs()
 -virNWFilterPoolObjLoad()
 - virNWFilterPoolAssignDef() [ locks the filter ]
 - virNWFilterTriggerVMFilterRebuild()
 - qemudVMFilterRebuild()
 
 - nwfilterDriverStartup()
 - virNWFilterPoolLoadAllConfigs()
 -virNWFilterPoolObjLoad()
 - virNWFilterPoolAssignDef() [ locks the filter ]
 - virNWFilterTriggerVMFilterRebuild()
 - qemudVMFilterRebuild()
 
 Qemu is not the only driver using the nwfilter driver, but also the
 UML driver calls into it. Therefore qemuVMFilterRebuild() can be
 exchanged with umlVMFilterRebuild() along with the driver lock of
 qemu_driver that can now be a uml_driver. Further, since UML and
 Qemu domains can be running on the same machine, the triggering of a
 rebuild of the filter can touch both types of drivers and their
 domains.

  okay

 In the patch below I am now extending each nwfilter callback driver
 with functions for locking and unlocking the (VM) driver (UML, QEMU)
 and introduce new functions for locking all registered callback
 drivers and unlocking them. Then I am distributing the
 lock-all-cbdrivers/unlock-all-cbdrivers call into the above call
 paths. The last shown callpath starting with nwfilterDriverStart()
 is problematic since it is initialize before the Qemu and UML drives
 are and thus a lock in the path would result in a NULL pointer
 attempted to be locked -- the call to
 virNWFilterTriggerVMFilterRebuild() is never called, so we never
 lock either the qemu_driver or the uml_driver in that path.
 Therefore, only the first 3 paths now receive calls to lock and
 unlock all callback drivers. Now that the locks are distributed
 where it matters I can remove the qemu_driver and uml_driver lock
 from qemudVMFilterRebuild() and umlVMFilterRebuild() and not
 requiring the recursive locks.

  okay,

 For now I want to put this out as an RFC patch. I have tested it by
 'stretching' the critical section after the define/undefine
 functions each lock the filter so I can (easily) concurrently
 execute another VM operation (suspend,start). That code is in this
 patch and if you want you can de-activate it. It seems to work ok
 and operations are being blocked while the update is being done.
 I still also want to verify the other assumption above that locking
 filter and qemu_domain always has a preceding qemu_driver lock.

  I looked at the patch and the only thing that need to be cleaned up
is the stretching blocki below, otherwise looks fine to me.

 
 +if (true) {
 +fprintf(stderr,sleep in %s\n, __FUNCTION__);
 +sleep(CRITICAL_SECTION_STRETCH);
 +fprintf(stderr,continue in %s\n, __FUNCTION__);
 +}
 +

  It would be good to have some ways to exercise all code paths
in the locking (okay error handling specific paths aside because
it's really hard), before applying.

Daniel

-- 
Daniel Veillard  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
dan...@veillard.com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-07 Thread Stefan Berger

 On 10/07/2010 09:06 AM, Soren Hansen wrote:

I had trouble applying the patch (I think maybe Thunderbird may have
fiddled with the formatting :( ), but after doing it manually, it works
excellently. Thanks!


Great. I will prepare a V3.

I am also shooting a kill -SIGHUP at libvirt once in a while to see what 
happens (while creating / destroying 2 VMs and modifying their filters). 
Most of the time all goes well, but occasionally things do get stuck. I 
get the following debugging output from libvirt and attaching gdb to 
libvirt I see the following stack traces. Maybe Daniel can interpret 
this... To me it looks like some of the conditions need to be 'tickled'...



09:47:25.000: error : qemuAutostartDomain:822 : Failed to start job on 
VM 'dummy-vm1': Timed out during operation: cannot acquire state change lock


(gdb) thr ap all bt

Thread 9 (Thread 0x7f49bf592710 (LWP 17464)):
#0  0x00327680b729 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00435312 in virCondWaitUntil (c=value optimized out,
m=value optimized out, whenms=value optimized out)
at util/threads-pthread.c:115
#2  0x0043d0ab in qemuDomainObjBeginJobWithDriver 
(driver=0x1f9c010,

obj=0x7f49a00011b0) at qemu/qemu_driver.c:409
#3  0x00458abf in qemuAutostartDomain (payload=value optimized 
out,

name=value optimized out, opaque=0x7f49bf591320)
at qemu/qemu_driver.c:818
#4  0x7f49c040ab6a in virHashForEach (table=0x1f9be20,
iter=0x458a90 qemuAutostartDomain, data=0x7f49bf591320)
at util/hash.c:495
#5  0x0043cdac in qemudAutostartConfigs (driver=0x1f9c010)
at qemu/qemu_driver.c:855
#6  0x0043ce2a in qemudReload () at qemu/qemu_driver.c:2003
#7  0x7f49c0450a3e in virStateReload () at libvirt.c:1017
#8  0x004189e1 in qemudDispatchSignalEvent (
watch=value optimized out, fd=value optimized out,
events=value optimized out, opaque=0x1f6f830) at libvirtd.c:388
---Type return to continue, or q return to quit---
#9  0x004186a9 in virEventDispatchHandles () at event.c:479
#10 virEventRunOnce () at event.c:608
#11 0x0041a346 in qemudOneLoop () at libvirtd.c:2217
#12 0x0041a613 in qemudRunLoop (opaque=0x1f6f830) at libvirtd.c:2326
#13 0x003276807761 in start_thread () from /lib64/libpthread.so.0
#14 0x0032760e14ed in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f49beb91710 (LWP 17465)):
#0  0x00327680b3bc in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00435336 in virCondWait (c=value optimized out,
m=value optimized out) at util/threads-pthread.c:100
#2  0x0041b2e5 in qemudWorker (data=0x7f49b80008c0) at 
libvirtd.c:1549

#3  0x003276807761 in start_thread () from /lib64/libpthread.so.0
#4  0x0032760e14ed in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f49be190710 (LWP 17466)):
#0  0x00327680b3bc in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00435336 in virCondWait (c=value optimized out,
m=value optimized out) at util/threads-pthread.c:100
#2  0x004716f6 in qemuMonitorSend (mon=0x7f49a00040b0,
---Type return to continue, or q return to quit---
msg=value optimized out) at qemu/qemu_monitor.c:726
#3  0x00473673 in qemuMonitorCommandWithHandler 
(mon=0x7f49a00040b0,

cmd=0x4c316f info cpus, passwordHandler=0, passwordOpaque=0x0,
scm_fd=-1, reply=0x7f49be18e400) at qemu/qemu_monitor_text.c:233
#4  0x0047845e in qemuMonitorTextGetCPUInfo (
mon=value optimized out, pids=0x7f49be18fa18)
at qemu/qemu_monitor_text.c:401
#5  0x00455db2 in qemuDetectVcpuPIDs (conn=0x7f49ac036100,
driver=0x1f9c010, vm=value optimized out, migrateFrom=0x0,
start_paused=false, stdin_fd=-1, stdin_path=0x0) at 
qemu/qemu_driver.c:2413

#6  qemudStartVMDaemon (conn=0x7f49ac036100, driver=0x1f9c010,
vm=value optimized out, migrateFrom=0x0, start_paused=false,
stdin_fd=-1, stdin_path=0x0) at qemu/qemu_driver.c:3925
#7  0x00457f6b in qemudDomainCreate (conn=0x7f49ac036100,
xml=value optimized out, flags=0) at qemu/qemu_driver.c:4559
#8  0x7f49c04564ab in virDomainCreateXML (conn=0x7f49ac036100,
xmlDesc=0x7f49a02b8bf0  domain type='kvm'\n 
namedummy-vm2/name\n memory32768/memory\n 
currentMemory32768/currentMemory\n vcpu1/vcpu\n os\n 
typehvm/type\n boot dev='hd'/\n /os..., flags=0) at libvirt.c:1984

#9  0x00427da8 in remoteDispatchDomainCreateXml (
server=value optimized out, client=value optimized out,
---Type return to continue, or q return to quit---
conn=0x7f49ac036100, hdr=value optimized out, rerr=0x7f49be18fc70,
args=value optimized out, ret=0x7f49be18fbc0) at remote.c:1271
#10 0x0042a657 in remoteDispatchClientCall (server=0x1f6f830,
client=0x7f49b83f03c0, msg=0x7f49b8453800) at dispatch.c:529
#11 remoteDispatchClientRequest (server=0x1f6f830, client=0x7f49b83f03c0,

Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-07 Thread Stefan Berger

 On 10/07/2010 09:52 AM, Daniel Veillard wrote:

On Wed, Oct 06, 2010 at 05:56:36PM -0400, Stefan Berger wrote:

  V2:
   remove the locks from qemudVMFilterRebuild  umlVMFilterRebuild



For now I want to put this out as an RFC patch. I have tested it by
'stretching' the critical section after the define/undefine
functions each lock the filter so I can (easily) concurrently
execute another VM operation (suspend,start). That code is in this
patch and if you want you can de-activate it. It seems to work ok
and operations are being blocked while the update is being done.
I still also want to verify the other assumption above that locking
filter and qemu_domain always has a preceding qemu_driver lock.

   I looked at the patch and the only thing that need to be cleaned up
is the stretching blocki below, otherwise looks fine to me.


Yes, I'll remove that.

+if (true) {
+fprintf(stderr,sleep in %s\n, __FUNCTION__);
+sleep(CRITICAL_SECTION_STRETCH);
+fprintf(stderr,continue in %s\n, __FUNCTION__);
+}
+

   It would be good to have some ways to exercise all code paths
in the locking (okay error handling specific paths aside because
it's really hard), before applying.

It will take some time, but I'll try to write a TCK test that does 
something like Soren was testing:


- concurrently create/destroy 2 vms
- concurrently modify their filters
- plus occasionally shoot a SIGHUP at libvirt ( if possible )

That would test some of the code paths.

  Stefan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-07 Thread Daniel Veillard
On Thu, Oct 07, 2010 at 10:06:55AM -0400, Stefan Berger wrote:
  On 10/07/2010 09:52 AM, Daniel Veillard wrote:
 On Wed, Oct 06, 2010 at 05:56:36PM -0400, Stefan Berger wrote:
   V2:
remove the locks from qemudVMFilterRebuild  umlVMFilterRebuild
 
 For now I want to put this out as an RFC patch. I have tested it by
 'stretching' the critical section after the define/undefine
 functions each lock the filter so I can (easily) concurrently
 execute another VM operation (suspend,start). That code is in this
 patch and if you want you can de-activate it. It seems to work ok
 and operations are being blocked while the update is being done.
 I still also want to verify the other assumption above that locking
 filter and qemu_domain always has a preceding qemu_driver lock.
I looked at the patch and the only thing that need to be cleaned up
 is the stretching blocki below, otherwise looks fine to me.
 
 Yes, I'll remove that.

 sure, it was an RFC :-)

 +if (true) {
 +fprintf(stderr,sleep in %s\n, __FUNCTION__);
 +sleep(CRITICAL_SECTION_STRETCH);
 +fprintf(stderr,continue in %s\n, __FUNCTION__);
 +}
 +
It would be good to have some ways to exercise all code paths
 in the locking (okay error handling specific paths aside because
 it's really hard), before applying.
 
 It will take some time, but I'll try to write a TCK test that does
 something like Soren was testing:
 
 - concurrently create/destroy 2 vms
 - concurrently modify their filters
 - plus occasionally shoot a SIGHUP at libvirt ( if possible )
 
 That would test some of the code paths.

  That would be nice, yes

Daniel

-- 
Daniel Veillard  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
dan...@veillard.com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


[libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-06 Thread Stefan Berger
 This is from a bug report and conversation on IRC where Soren reported 
that while a filter update is occurring on one or more VMs (due to a 
rule having been edited for example), a deadlock can occur when a VM 
referencing a filter is started.


The problem is caused by the two locking sequences of

qemu driver, qemu domain, filter # for the VM start operation
filter, qemu_driver, qemu_domain# for the filter update 
operation


that obviously don't lock in the same order. The problem is the 2nd lock 
sequence. Here the qemu_driver lock is being grabbed in 
qemu_driver:qemudVMFilterRebuild()


The following solution is based on the idea of trying to re-arrange the 
2nd sequence of locks as follows:


qemu_driver, filter, qemu_driver, qemu_domain

and making the qemu driver recursively lockable so that a second lock 
can occur, this would then lead to the following net-locking sequence


qemu_driver, filter, qemu_domain

where the 2nd qemu_driver lock has been ( logically ) eliminated.

The 2nd part of the idea is that the sequence of locks (filter, 
qemu_domain) and (qemu_domain, filter) becomes interchangeable if all 
code paths where filter AND qemu_domain are locked have a preceding 
qemu_domain lock that basically blocks their concurrent execution


So, the following code paths exist towards 
qemu_driver:qemudVMFilterRebuild where we now want to put a qemu_driver 
lock in front of the filter lock.


- nwfilterUndefine()   [ locks the filter ]
- virNWFilterTestUnassignDef()
- virNWFilterTriggerVMFilterRebuild()
- qemudVMFilterRebuild()

- nwfilterDefine()
- virNWFilterPoolAssignDef() [ locks the filter ]
- virNWFilterTriggerVMFilterRebuild()
- qemudVMFilterRebuild()

- nwfilterDriverReload()
- virNWFilterPoolLoadAllConfigs()
-virNWFilterPoolObjLoad()
- virNWFilterPoolAssignDef() [ locks the filter ]
- virNWFilterTriggerVMFilterRebuild()
- qemudVMFilterRebuild()

- nwfilterDriverStartup()
- virNWFilterPoolLoadAllConfigs()
-virNWFilterPoolObjLoad()
- virNWFilterPoolAssignDef() [ locks the filter ]
- virNWFilterTriggerVMFilterRebuild()
- qemudVMFilterRebuild()

Qemu is not the only driver using the nwfilter driver, but also the UML 
driver calls into it. Therefore qemuVMFilterRebuild() can be exchanged 
with umlVMFilterRebuild() along with the driver lock of qemu_driver that 
can now be a uml_driver. Further, since UML and Qemu domains can be 
running on the same machine, the triggering of a rebuild of the filter 
can touch both types of drivers and their domains.


In the patch below I am now extending each nwfilter callback driver with 
functions for locking and unlocking the (VM) driver (UML, QEMU) and 
introduce new functions for locking all registered callback drivers and 
unlocking them. Then I am distributing the 
lock-all-cbdrivers/unlock-all-cbdrivers call into the above call paths. 
The last shown callpath starting with nwfilterDriverStart() is 
problematic since it is initialize before the Qemu and UML drives are 
and thus a lock in the path would result in a NULL pointer attempted to 
be locked -- the call to virNWFilterTriggerVMFilterRebuild() is never 
called, so we never lock either the qemu_driver or the uml_driver in 
that path. Therefore, only the first 3 paths now receive calls to lock 
and unlock all callback drivers. Now that the locks are distributed 
where it matters I can remove the qemu_driver and uml_driver lock from 
qemudVMFilterRebuild() and umlVMFilterRebuild() and not requiring the 
recursive locks.


For now I want to put this out as an RFC patch. I have tested it by 
'stretching' the critical section after the define/undefine functions 
each lock the filter so I can (easily) concurrently execute another VM 
operation (suspend,start). That code is in this patch and if you want 
you can de-activate it. It seems to work ok and operations are being 
blocked while the update is being done.
I still also want to verify the other assumption above that locking 
filter and qemu_domain always has a preceding qemu_driver lock.


Signed-off-by: Stefan Berger stef...@us.ibm.com

---
 src/conf/nwfilter_conf.c   |   18 ++
 src/conf/nwfilter_conf.h   |6 ++
 src/libvirt_private.syms   |2 ++
 src/nwfilter/nwfilter_driver.c |   20 
 src/qemu/qemu_driver.c |   15 +++
 src/uml/uml_driver.c   |   14 ++
 6 files changed, 75 insertions(+)

Index: libvirt-acl/src/conf/nwfilter_conf.h
===
--- libvirt-acl.orig/src/conf/nwfilter_conf.h
+++ libvirt-acl/src/conf/nwfilter_conf.h
@@ -639,6 +639,8 @@ void virNWFilterConfLayerShutdown(void);

 typedef int (*virNWFilterRebuild)(virConnectPtr conn,
   virHashIterator, void 

Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-06 Thread Stefan Berger

 On 10/06/2010 12:27 PM, Stefan Berger wrote:
distributed where it matters I can remove the qemu_driver and 
uml_driver lock from qemudVMFilterRebuild() and umlVMFilterRebuild() 
and not requiring the recursive locks.


Uiui, the words are right, the code is not. I apologize. V2 coming 
shortly + my test programs.


  Stefan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


[libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

2010-10-06 Thread Stefan Berger

 V2:
  remove the locks from qemudVMFilterRebuild  umlVMFilterRebuild

 This is from a bug report and conversation on IRC where Soren reported 
that while a filter update is occurring on one or more VMs (due to a 
rule having been edited for example), a deadlock can occur when a VM 
referencing a filter is started.


The problem is caused by the two locking sequences of

qemu driver, qemu domain, filter # for the VM start operation
filter, qemu_driver, qemu_domain# for the filter update 
operation


that obviously don't lock in the same order. The problem is the 2nd lock 
sequence. Here the qemu_driver lock is being grabbed in 
qemu_driver:qemudVMFilterRebuild()


The following solution is based on the idea of trying to re-arrange the 
2nd sequence of locks as follows:


qemu_driver, filter, qemu_driver, qemu_domain

and making the qemu driver recursively lockable so that a second lock 
can occur, this would then lead to the following net-locking sequence


qemu_driver, filter, qemu_domain

where the 2nd qemu_driver lock has been ( logically ) eliminated.

The 2nd part of the idea is that the sequence of locks (filter, 
qemu_domain) and (qemu_domain, filter) becomes interchangeable if all 
code paths where filter AND qemu_domain are locked have a preceding 
qemu_domain lock that basically blocks their concurrent execution


So, the following code paths exist towards 
qemu_driver:qemudVMFilterRebuild where we now want to put a qemu_driver 
lock in front of the filter lock.


- nwfilterUndefine()   [ locks the filter ]
- virNWFilterTestUnassignDef()
- virNWFilterTriggerVMFilterRebuild()
- qemudVMFilterRebuild()

- nwfilterDefine()
- virNWFilterPoolAssignDef() [ locks the filter ]
- virNWFilterTriggerVMFilterRebuild()
- qemudVMFilterRebuild()

- nwfilterDriverReload()
- virNWFilterPoolLoadAllConfigs()
-virNWFilterPoolObjLoad()
- virNWFilterPoolAssignDef() [ locks the filter ]
- virNWFilterTriggerVMFilterRebuild()
- qemudVMFilterRebuild()

- nwfilterDriverStartup()
- virNWFilterPoolLoadAllConfigs()
-virNWFilterPoolObjLoad()
- virNWFilterPoolAssignDef() [ locks the filter ]
- virNWFilterTriggerVMFilterRebuild()
- qemudVMFilterRebuild()

Qemu is not the only driver using the nwfilter driver, but also the UML 
driver calls into it. Therefore qemuVMFilterRebuild() can be exchanged 
with umlVMFilterRebuild() along with the driver lock of qemu_driver that 
can now be a uml_driver. Further, since UML and Qemu domains can be 
running on the same machine, the triggering of a rebuild of the filter 
can touch both types of drivers and their domains.


In the patch below I am now extending each nwfilter callback driver with 
functions for locking and unlocking the (VM) driver (UML, QEMU) and 
introduce new functions for locking all registered callback drivers and 
unlocking them. Then I am distributing the 
lock-all-cbdrivers/unlock-all-cbdrivers call into the above call paths. 
The last shown callpath starting with nwfilterDriverStart() is 
problematic since it is initialize before the Qemu and UML drives are 
and thus a lock in the path would result in a NULL pointer attempted to 
be locked -- the call to virNWFilterTriggerVMFilterRebuild() is never 
called, so we never lock either the qemu_driver or the uml_driver in 
that path. Therefore, only the first 3 paths now receive calls to lock 
and unlock all callback drivers. Now that the locks are distributed 
where it matters I can remove the qemu_driver and uml_driver lock from 
qemudVMFilterRebuild() and umlVMFilterRebuild() and not requiring the 
recursive locks.


For now I want to put this out as an RFC patch. I have tested it by 
'stretching' the critical section after the define/undefine functions 
each lock the filter so I can (easily) concurrently execute another VM 
operation (suspend,start). That code is in this patch and if you want 
you can de-activate it. It seems to work ok and operations are being 
blocked while the update is being done.
I still also want to verify the other assumption above that locking 
filter and qemu_domain always has a preceding qemu_driver lock.


Signed-off-by: Stefan Berger stef...@us.ibm.com

---
 src/conf/nwfilter_conf.c   |   18 ++
 src/conf/nwfilter_conf.h   |6 ++
 src/libvirt_private.syms   |2 ++
 src/nwfilter/nwfilter_driver.c |   26 ++
 src/qemu/qemu_driver.c |   19 +++
 src/uml/uml_driver.c   |   18 ++
 6 files changed, 81 insertions(+), 8 deletions(-)

Index: libvirt-acl/src/conf/nwfilter_conf.h
===
--- libvirt-acl.orig/src/conf/nwfilter_conf.h
+++ libvirt-acl/src/conf/nwfilter_conf.h
@@ -639,6 +639,8 @@ void virNWFilterConfLayerShutdown(void);

 typedef int