* Peter Xu (pet...@redhat.com) wrote: > On Mon, Aug 28, 2017 at 12:11:38PM +0200, Marc-André Lureau wrote: > > Hi > > > > On Mon, Aug 28, 2017 at 5:05 AM, Peter Xu <pet...@redhat.com> wrote: > > > On Fri, Aug 25, 2017 at 04:07:34PM +0000, Marc-André Lureau wrote: > > >> On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert > > >> <dgilb...@redhat.com> > > >> wrote: > > >> > > >> > * Marc-André Lureau (marcandre.lur...@gmail.com) wrote: > > >> > > Hi > > >> > > > > >> > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <pet...@redhat.com> wrote: > > >> > > > > >> > > > Firstly, introduce Monitor.use_thread, and set it for monitors > > >> > > > that are > > >> > > > using non-mux typed backend chardev. We only do this for > > >> > > > monitors, so > > >> > > > mux-typed chardevs are not suitable (when it connects to, e.g., > > >> > > > serials > > >> > > > and the monitor together). > > >> > > > > > >> > > > When use_thread is set, we create standalone thread to poll the > > >> > > > monitor > > >> > > > events, isolated from the main loop thread. Here we still need to > > >> > > > take > > >> > > > the BQL before dispatching the tasks since some of the monitor > > >> > > > commands > > >> > > > are not allowed to execute without the protection of BQL. Then > > >> > > > this > > >> > > > gives us the chance to avoid taking the BQL for some monitor > > >> > > > commands > > >> > in > > >> > > > the future. > > >> > > > > > >> > > > * Why this change? > > >> > > > > > >> > > > We need these per-monitor threads to make sure we can have at > > >> > > > least one > > >> > > > monitor that will never stuck (that can receive further monitor > > >> > > > commands). > > >> > > > > > >> > > > * So when will monitors stuck? And, how do they stuck? > > >> > > > > > >> > > > After we have postcopy and remote page faults, it's simple to > > >> > > > achieve a > > >> > > > stuck in the monitor (which is also a stuck in main loop thread): > > >> > > > > > >> > > > (1) Monitor deadlock on BQL > > >> > > > > > >> > > > As we may know, when postcopy is running on destination VM, the > > >> > > > vcpu > > >> > > > threads can stuck merely any time as long as it tries to access an > > >> > > > uncopied guest page. Meanwhile, when the stuck happens, it is > > >> > > > possible > > >> > > > that the vcpu thread is holding the BQL. If the page fault is not > > >> > > > handled quickly, you'll find that monitors stop working, which is > > >> > trying > > >> > > > to take the BQL. > > >> > > > > > >> > > > If the page fault cannot be handled correctly (one case is a paused > > >> > > > postcopy, when network is temporarily down), monitors will hang > > >> > > > forever. Without current patch, that means the main loop hanged. > > >> > We'll > > >> > > > never find a way to talk to VM again. > > >> > > > > > >> > > > > >> > > Could the BQL be pushed down to the monitor commands level instead? > > >> > > That > > >> > > way we wouldn't need a seperate thread to solve the hang on commands > > >> > > that > > >> > > do not need BQL. > > >> > > > >> > If the main thread is stuck though I don't see how that helps you; you > > >> > have to be able to run these commands on another thread. > > >> > > > >> > > >> Why would the main thread be stuck? In (1) If the vcpu thread takes the > > >> BQL > > >> and the command doesn't need it, it would work. In (2), info cpus > > >> shouldn't keep the BQL (my qapi-async series would probably help here) > > > > > > (Thanks for joining the discussion) > > > > > > AFAIK the main thread can be stuck for many reasons. I have seen one > > > stack when the VGA code (IIUC) was trying to writting to guest graphic > > > memory in main loop thread but luckily that guest page is still not > > > copied yet from source. As long as the main thread is stuck for any > > > reason, no chance for monitor commands, even if the commands support > > > async operations. > > > > If that command becomes async (it probably should, any command doing > > IO probaly should), then the main loop can keep running. > > The problem is that, it's not blocked at "a command", but a task > running on the main thread. The task can access guest memory, and > when the guest page is not there, the main thread hangs. Then it > hangs every monitors, and all other tasks that are bounded to main > thread.
This is my main reason for believing we need this approach; I don't see how the async-command solution solves it. (I don't have anything against the async command stuff, I just don't think it solves this problem) Dave > > > > > > > > So IMHO the only solution is doing these things in separate threads, > > > rather than all in a single one. > > > > I wouldn't say it's the only solution. I think the monitor can touch > > many areas that haven't been written with multi-threading in mind. My > > proposal is probably safer, although I don't know how hard it would be > > to push the BQL down to QMP commands, and make async existing IO > > commands. The benefits of this work are quite interesting imho, > > because a stuck mainloop is basically a stuck qemu, and an additional > > thread will not solve it... > > > > -- > > Marc-André Lureau > > -- > Peter Xu -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK