2010/12/16 Michael S. Tsirkin <m...@redhat.com>:
> On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/3 Yoshiaki Tamura <tamura.yoshi...@lab.ntt.co.jp>:
>> > 2010/12/2 Michael S. Tsirkin <m...@redhat.com>:
>> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >>> 2010/11/28 Michael S. Tsirkin <m...@redhat.com>:
>> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >>> >> 2010/11/28 Michael S. Tsirkin <m...@redhat.com>:
>> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >>> >> >>
>> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshi...@lab.ntt.co.jp>
>> >>> >> >
>> >>> >> > This changes migration format, so it will break compatibility with
>> >>> >> > existing drivers. More generally, I think migrating internal
>> >>> >> > state that is not guest visible is always a mistake
>> >>> >> > as it ties migration format to an internal implementation
>> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >>> >> > try not to add such cases).  I think the right thing to do in this 
>> >>> >> > case
>> >>> >> > is to flush outstanding
>> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >>> >> > I sent patches that do this for virtio net and block.
>> >>> >>
>> >>> >> Could you give me the link of your patches?  I'd like to test
>> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >>> >> happy to drop this patch.
>> >>> >>
>> >>> >> Yoshi
>> >>> >
>> >>> > Look for this:
>> >>> > stable migration image on a stopped vm
>> >>> > sent on:
>> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >>>
>> >>> Thanks for the info.
>> >>>
>> >>> However, The patch series above didn't solve the issue.  In
>> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >>> output, and while last_avail_idx gets incremented
>> >>> immediately, not sending inuse makes the state inconsistent
>> >>> between Primary and Secondary.
>> >>
>> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >
>> > I think we can calculate or prepare an internal last_avail_idx,
>> > and update the external when inuse is decremented.  I'll try
>> > whether it work w/ w/o Kemari.
>>
>> Hi Michael,
>>
>> Could you please take a look at the following patch?
>
> Which version is this against?

Oops.  It should be very old.
67f895bfe69f323b427b284430b6219c8a62e8d4

>> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> Author: Yoshiaki Tamura <tamura.yoshi...@lab.ntt.co.jp>
>> Date:   Thu Dec 16 14:50:54 2010 +0900
>>
>>     virtio: update last_avail_idx when inuse is decreased.
>>
>>     Signed-off-by: Yoshiaki Tamura <tamura.yoshi...@lab.ntt.co.jp>
>
> It would be better to have a commit description explaining why a change
> is made, and why it is correct, not just repeating what can be seen from
> the diff anyway.

Sorry for being lazy here.

>> diff --git a/hw/virtio.c b/hw/virtio.c
>> index c8a0fc6..6688c02 100644
>> --- a/hw/virtio.c
>> +++ b/hw/virtio.c
>> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>>      wmb();
>>      trace_virtqueue_flush(vq, count);
>>      vring_used_idx_increment(vq, count);
>> +    vq->last_avail_idx += count;
>>      vq->inuse -= count;
>>  }
>>
>> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>      unsigned int i, head, max;
>>      target_phys_addr_t desc_pa = vq->vring.desc;
>>
>> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>>          return 0;
>>
>>      /* When we start there are none of either input nor output. */
>> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>
>>      max = vq->vring.num;
>>
>> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>>
>>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>>
>
> Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?

I think there are two problems.

1. When to update last_avail_idx.
2. The ordering issue you're mentioning below.

The patch above is only trying to address 1 because last time you
mentioned that modifying last_avail_idx upon save may break the
guest, which I agree.  If virtio_queue_empty and
virtqueue_avail_bytes are only used internally, meaning invisible
to the guest, I guess the approach above can be applied too.

> Previous patch version sure looked simpler, and this seems functionally
> equivalent, so my question still stands: here it is rephrased in a
> different way:
>
>        assume that we have in avail ring 2 requests at start of ring: A and B 
> in this order
>
>        host pops A, then B, then completes B and flushes
>
>        now with this patch last_avail_idx will be 1, and then
>        remote will get it, it will execute B again. As a result
>        B will complete twice, and apparently A will never complete.
>
>
> This is what I was saying below: assuming that there are
> outstanding requests when we migrate, there is no way
> a single index can be enough to figure out which requests
> need to be handled and which are in flight already.
>
> We must add some kind of bitmask to tell us which is which.

I should understand why this inversion can happen before solving
the issue.  Currently, how are you making virio-net to flush
every requests for live migration?  Is it qemu_aio_flush()?

Yoshi

>
>> >
>> >>
>> >>>  I'm wondering why
>> >>> last_avail_idx is OK to send but not inuse.
>> >>
>> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> our internal implementation, but it does *also* express
>> >> a guest observable state.
>> >>
>> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> there is no way to detect that a specific request has already been
>> >> completed. And the protocol forbids completing the same request twice.
>> >>
>> >> Our implementation always starts processing the requests
>> >> in order, and since we flush outstanding requests
>> >> before save, it works to just tell the remote 'process only requests
>> >> after this place'.
>> >>
>> >> But there's no such requirement in the virtio protocol,
>> >> so to be really generic we could add a bitmask of valid avail
>> >> ring entries that did not complete yet. This would be
>> >> the exact representation of the guest observable state.
>> >> In practice we have rings of up to 512 entries.
>> >> That's 64 byte per ring, not a lot at all.
>> >>
>> >> However, if we ever do change the protocol to send the bitmask,
>> >> we would need some code to resubmit requests
>> >> out of order, so it's not trivial.
>> >>
>> >> Another minor mistake with last_avail_idx is that it has
>> >> some redundancy: the high bits in the index
>> >> (> vq size) are not necessary as they can be
>> >> got from avail idx.  There's a consistency check
>> >> in load but we really should try to use formats
>> >> that are always consistent.
>> >>
>> >>> The following patch does the same thing as original, yet
>> >>> keeps the format of the virtio.  It shouldn't break live
>> >>> migration either because inuse should be 0.
>> >>>
>> >>> Yoshi
>> >>
>> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> And if not, how do you handle the fact that some requests
>> >> are in flight on the primary?
>> >
>> > Although we try flushing requests one by one making inuse 0,
>> > there are cases when it failovers to the secondary when inuse
>> > isn't 0.  We handle these in flight request on the primary by
>> > replaying on the secondary.
>> >
>> >>
>> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >>> index c8a0fc6..875c7ca 100644
>> >>> --- a/hw/virtio.c
>> >>> +++ b/hw/virtio.c
>> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >>>      qemu_put_be32(f, i);
>> >>>
>> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >>> +        uint16_t last_avail_idx;
>> >>> +
>> >>>          if (vdev->vq[i].vring.num == 0)
>> >>>              break;
>> >>>
>> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >>> +
>> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >>>          if (vdev->binding->save_queue)
>> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >>>      }
>> >>>
>> >>>
>> >>
>> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> not?  So if request 0 did not complete and request 1 did not,
>> >> you send avail - inuse and on the secondary you will process and
>> >> complete request 1 the second time, crashing the guest.
>> >
>> > In case of Kemari, no.  We sit between devices and net/block, and
>> > queue the requests.  After completing each transaction, we flush
>> > the requests one by one.  So there won't be completion inversion,
>> > and therefore won't be visible to the guest.
>> >
>> > Yoshi
>> >
>> >>
>> >>>
>> >>> >
>> >>> >> >
>> >>> >> >> ---
>> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >>> >> >>
>> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >>> >> >> index 849a60f..5509644 100644
>> >>> >> >> --- a/hw/virtio.c
>> >>> >> >> +++ b/hw/virtio.c
>> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >>> >> >>      VRing vring;
>> >>> >> >>      target_phys_addr_t pa;
>> >>> >> >>      uint16_t last_avail_idx;
>> >>> >> >> -    int inuse;
>> >>> >> >> +    uint16_t inuse;
>> >>> >> >>      uint16_t vector;
>> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >>> >> >>      VirtIODevice *vdev;
>> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile 
>> >>> >> >> *f)
>> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >>> >> >>          if (vdev->binding->save_queue)
>> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >>> >> >>      }
>> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile 
>> >>> >> >> *f)
>> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >>> >> >> +
>> >>> >> >> +        /* revert last_avail_idx if there are outstanding 
>> >>> >> >> emulation. */
>> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >>> >> >>
>> >>> >> >>          if (vdev->vq[i].pa) {
>> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >>> >> >> --
>> >>> >> >> 1.7.1.2
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> >> >> the body of a message to majord...@vger.kernel.org
>> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >> > --
>> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> >> > the body of a message to majord...@vger.kernel.org
>> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >> >
>> >>> > --
>> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> > the body of a message to majord...@vger.kernel.org
>> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majord...@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Reply via email to