On Sun, Aug 14, 2016 at 09:42:11AM +0000, Prerna Saxena wrote:
> On 14/08/16 8:21 am, "Michael S. Tsirkin" <m...@redhat.com> wrote:
> 
> 
> >On Fri, Aug 12, 2016 at 07:16:34AM +0000, Prerna Saxena wrote:
> >> 
> >> On 12/08/16 12:08 pm, "Fam Zheng" <f...@redhat.com> wrote:
> >> 
> >> 
> >> 
> >> 
> >> 
> >> >On Wed, 08/10 18:30, Michael S. Tsirkin wrote:
> >> >> From: Prerna Saxena <prerna.sax...@nutanix.com>
> >> >> 
> >> >> The set_mem_table command currently does not seek a reply. Hence, there 
> >> >> is
> >> >> no easy way for a remote application to notify to QEMU when it finished
> >> >> setting up memory, or if there were errors doing so.
> >> >> 
> >> >> As an example:
> >> >> (1) Qemu sends a SET_MEM_TABLE to the backend (eg, a vhost-user net
> >> >> application). SET_MEM_TABLE does not require a reply according to the 
> >> >> spec.
> >> >> (2) Qemu commits the memory to the guest.
> >> >> (3) Guest issues an I/O operation over a new memory region which was 
> >> >> configured on (1).
> >> >> (4) The application has not yet remapped the memory, but it sees the 
> >> >> I/O request.
> >> >> (5) The application cannot satisfy the request because it does not know 
> >> >> about those GPAs.
> >> >> 
> >> >> While a guaranteed fix would require a protocol extension (committed 
> >> >> separately),
> >> >> a best-effort workaround for existing applications is to send a 
> >> >> GET_FEATURES
> >> >> message before completing the vhost_user_set_mem_table() call.
> >> >> Since GET_FEATURES requires a reply, an application that processes 
> >> >> vhost-user
> >> >> messages synchronously would probably have completed the SET_MEM_TABLE 
> >> >> before replying.
> >> >> 
> >> >> Signed-off-by: Prerna Saxena <prerna.sax...@nutanix.com>
> >> >> Reviewed-by: Michael S. Tsirkin <m...@redhat.com>
> >> >> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> >> >
> >> >Sporadic hangs are seen with test-vhost-user after this patch:
> >> >
> >> >https://travis-ci.org/qemu/qemu/builds
> >> >
> >> >Reverting seems to fix it for me.
> >> >
> >> >Is this a known problem?
> >> >
> >> >Fam
> >> 
> >> Hi Fam,
> >> Thanks for reporting the sporadic hangs. I had seen ‘make check’ pass on 
> >> my Centos 6 environment, so missed this.
> >> I am setting up the docker test env to repro this, but I think I can guess 
> >> the problem :
> >> 
> >> In tests/vhost-user-test.c: 
> >> 
> >> static void chr_read(void *opaque, const uint8_t *buf, int size)
> >> {
> >> ..[snip]..
> >> 
> >>     case VHOST_USER_SET_MEM_TABLE:
> >>        /* received the mem table */
> >>        memcpy(&s->memory, &msg.payload.memory, sizeof(msg.payload.memory));
> >>        s->fds_num = qemu_chr_fe_get_msgfds(chr, s->fds, 
> >> G_N_ELEMENTS(s->fds));
> >> 
> >> 
> >>        /* signal the test that it can continue */
> >>        g_cond_signal(&s->data_cond);
> >>        break;
> >> ..[snip]..
> >> }
> >> 
> >> 
> >> The test seems to be marked complete as soon as mem_table is copied. 
> >> However, this patch 3/3 changes the behaviour of the SET_MEM_TABLE vhost 
> >> command implementation with qemu. SET_MEM_TABLE now sends out a new 
> >> message GET_FEATURES, and the call is only completed once it receives 
> >> features from the remote application. (or the test framework, as is the 
> >> case here.)
> >
> >Hmm but why does it matter that data_cond is woken up?
> 
> Michael, sorry, I didn’t quite understand that. Could you pls explain ?


We do g_cond_signal but I don't see where does it prevent
test from responding to GET_FEATURES. Except if test exited
signaling success - but then why do we care?


> >
> >
> >> While the test itself can be modified (Do not signal completion until 
> >> we’ve sent a follow-up response to GET_FEATURES), I am now wondering if 
> >> this patch may break existing vhost applications too ? If so, reverting it 
> >> possibly better.
> >
> >What bothers me is that the new feature might cause the same
> >issue once we enable it in the test.
> 
> No it wont. The new feature is a protocol extension, and only works if it has 
> been negotiated with. If not negotiated, that part of code is never executed.

Absolutely - this reduces the risk - but how do we know that whatever
problem causes the test failures is not there with the new feature?


> >
> >How about a patch to tests/vhost-user-test.c adding the new
> >protocol feature? I would be quite interested to see what
> >is going on with it.
> 
> Yes that can be done. But you can see that the protocol extension patch will 
> not change the behaviour of the _existing_ test.

If that passes on travis, then we'll be more confident -
after all it is the travis failures that cause us to
reconsider the work-around.

> >
> >
> >> What confuses me is why it doesn’t fail all the time, but only about 20% 
> >> to 30% time as Fam reports. 
> >
> >And succeeds every time on my systems :(
> 
> +1 to that :( I have had no luck repro’ing it.
> 
> >
> >> 
> >> Thoughts : Michael, Fam, MarcAndre ?
> >> 
> >> Regards,
> >>
> 
> Prerna

Reply via email to