On Sun, Aug 14, 2016 at 09:42:11AM +0000, Prerna Saxena wrote: > On 14/08/16 8:21 am, "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > >On Fri, Aug 12, 2016 at 07:16:34AM +0000, Prerna Saxena wrote: > >> > >> On 12/08/16 12:08 pm, "Fam Zheng" <f...@redhat.com> wrote: > >> > >> > >> > >> > >> > >> >On Wed, 08/10 18:30, Michael S. Tsirkin wrote: > >> >> From: Prerna Saxena <prerna.sax...@nutanix.com> > >> >> > >> >> The set_mem_table command currently does not seek a reply. Hence, there > >> >> is > >> >> no easy way for a remote application to notify to QEMU when it finished > >> >> setting up memory, or if there were errors doing so. > >> >> > >> >> As an example: > >> >> (1) Qemu sends a SET_MEM_TABLE to the backend (eg, a vhost-user net > >> >> application). SET_MEM_TABLE does not require a reply according to the > >> >> spec. > >> >> (2) Qemu commits the memory to the guest. > >> >> (3) Guest issues an I/O operation over a new memory region which was > >> >> configured on (1). > >> >> (4) The application has not yet remapped the memory, but it sees the > >> >> I/O request. > >> >> (5) The application cannot satisfy the request because it does not know > >> >> about those GPAs. > >> >> > >> >> While a guaranteed fix would require a protocol extension (committed > >> >> separately), > >> >> a best-effort workaround for existing applications is to send a > >> >> GET_FEATURES > >> >> message before completing the vhost_user_set_mem_table() call. > >> >> Since GET_FEATURES requires a reply, an application that processes > >> >> vhost-user > >> >> messages synchronously would probably have completed the SET_MEM_TABLE > >> >> before replying. > >> >> > >> >> Signed-off-by: Prerna Saxena <prerna.sax...@nutanix.com> > >> >> Reviewed-by: Michael S. Tsirkin <m...@redhat.com> > >> >> Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > >> > > >> >Sporadic hangs are seen with test-vhost-user after this patch: > >> > > >> >https://travis-ci.org/qemu/qemu/builds > >> > > >> >Reverting seems to fix it for me. > >> > > >> >Is this a known problem? > >> > > >> >Fam > >> > >> Hi Fam, > >> Thanks for reporting the sporadic hangs. I had seen ‘make check’ pass on > >> my Centos 6 environment, so missed this. > >> I am setting up the docker test env to repro this, but I think I can guess > >> the problem : > >> > >> In tests/vhost-user-test.c: > >> > >> static void chr_read(void *opaque, const uint8_t *buf, int size) > >> { > >> ..[snip].. > >> > >> case VHOST_USER_SET_MEM_TABLE: > >> /* received the mem table */ > >> memcpy(&s->memory, &msg.payload.memory, sizeof(msg.payload.memory)); > >> s->fds_num = qemu_chr_fe_get_msgfds(chr, s->fds, > >> G_N_ELEMENTS(s->fds)); > >> > >> > >> /* signal the test that it can continue */ > >> g_cond_signal(&s->data_cond); > >> break; > >> ..[snip].. > >> } > >> > >> > >> The test seems to be marked complete as soon as mem_table is copied. > >> However, this patch 3/3 changes the behaviour of the SET_MEM_TABLE vhost > >> command implementation with qemu. SET_MEM_TABLE now sends out a new > >> message GET_FEATURES, and the call is only completed once it receives > >> features from the remote application. (or the test framework, as is the > >> case here.) > > > >Hmm but why does it matter that data_cond is woken up? > > Michael, sorry, I didn’t quite understand that. Could you pls explain ?
We do g_cond_signal but I don't see where does it prevent test from responding to GET_FEATURES. Except if test exited signaling success - but then why do we care? > > > > > >> While the test itself can be modified (Do not signal completion until > >> we’ve sent a follow-up response to GET_FEATURES), I am now wondering if > >> this patch may break existing vhost applications too ? If so, reverting it > >> possibly better. > > > >What bothers me is that the new feature might cause the same > >issue once we enable it in the test. > > No it wont. The new feature is a protocol extension, and only works if it has > been negotiated with. If not negotiated, that part of code is never executed. Absolutely - this reduces the risk - but how do we know that whatever problem causes the test failures is not there with the new feature? > > > >How about a patch to tests/vhost-user-test.c adding the new > >protocol feature? I would be quite interested to see what > >is going on with it. > > Yes that can be done. But you can see that the protocol extension patch will > not change the behaviour of the _existing_ test. If that passes on travis, then we'll be more confident - after all it is the travis failures that cause us to reconsider the work-around. > > > > > >> What confuses me is why it doesn’t fail all the time, but only about 20% > >> to 30% time as Fam reports. > > > >And succeeds every time on my systems :( > > +1 to that :( I have had no luck repro’ing it. > > > > >> > >> Thoughts : Michael, Fam, MarcAndre ? > >> > >> Regards, > >> > > Prerna