On 09/18/2014 07:56 PM, Paolo Bonzini wrote: > Il 18/09/2014 05:26, Alexey Kardashevskiy ha scritto: >> On 09/18/2014 01:07 AM, Stefan Hajnoczi wrote: >>> On Wed, Sep 17, 2014 at 2:44 PM, Alexey Kardashevskiy <a...@ozlabs.ru> >>> wrote: >>>> On 09/17/2014 07:25 PM, Paolo Bonzini wrote: >>>> btw any better idea of a hack to try? Testers are pushing me - they want to >>>> upgrade the broken setup and I am blocking them :) Thanks! >>> >>> Paolo's qemu_co_mutex_lock(&s->lock) idea in qcow2_invalidate_cache() >>> is good. Have you tried that patch? >> >> >> Yes, did not help. >> >>> >>> I haven't checked the qcow2 code whether that works properly across >>> bdrv_close() (is the lock freed?) but in principle that's how you >>> protect against concurrent I/O. >> >> I thought we have to avoid qemu_coroutine_yield() in this particular case. >> I fail to see how the locks may help if we still do yeild. But the whole >> thing is already way behind of my understanding :) For example - how many >> BlockDriverState things are layered here? NBD -> QCOW2 -> RAW? > > No, this is an NBD server. So we have three users of the same QCOW2 > image: migration, NBD server and virtio disk (not active while the bug > happens, and thus not depicted): > > > NBD server -> QCOW2 <- migration > | > v > File > > The problem is that the NBD server accesses the QCOW2 image while > migration does qcow2_invalidate_cache.
Ufff. Cool. Anyway, the qemu_co_mutex_lock(&s->lock) hack does not work as after qcow2_close() the lock is cleared and qemu_co_mutex_unlock(&s->lock) fails. Moving the lock to BlockDriverState caused weird side effects, debugging... -- Alexey