Hi Sergey, Nice review of the implementations we have so far. Just few comments below.
On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.f...@gmail.com> wrote: > On 10/06/16 00:51, Sergey Fedorov wrote: >> For certain kinds of tasks we might need a quiescent state to perform an >> operation safely. Quiescent state means no CPU thread executing, and >> probably BQL held as well. The tasks could include: >> - Translation buffer flush (user and system-mode) >> - Cross-CPU TLB flush (system-mode) >> - Exclusive operation emulation (user-mode) >> >> If we use a single shared translation buffer which is not managed by RCU >> and simply flushed when full, we'll need a quiescent state to flush it >> safely. >> >> In multi-threaded TCG, cross-CPU TLB flush from TCG helpers could >> probably be made with async_run_on_cpu(). I suppose it is always the >> guest system that needs to synchronise this operation properly. And as >> soon as we request the target CPU to exit its execution loop for serving >> the asynchronous work, we should probably be okay to continue execution >> on the CPU requested the operation while the target CPU executing till >> the end of its current TB before it actually flushed its TLB. >> >> As of slow-path LL/SC emulation in multi-threaded TCG, cross-CPU TLB >> flushes (actually TLB flushes on all CPUs) must me done synchronously >> and thus might require quiescent state. >> >> Exclusive operation emulation in user-mode is currently implemented in >> this manner, see for start_exclusive(). It might change to some generic >> mechanism of atomic/exclusive instruction emulation for system and >> user-mode. >> >> It looks like we need to implement a common mechanism to perform safe >> work in a quiescent state which could work in both system and user-mode, >> at least for safe translation bufferflush in user-mode and MTTCG. I'm >> going to implement such a mechanism. I would appreciate any suggestions, >> comments and remarks. > > Considering different attempts to implement similar functionality, I've > got the following summary. > > Fred's original async_run_safe_work_on_cpu() [1]: > - resembles async_run_on_cpu(); > - introduces a per-CPU safe work queue, a per-CPU flag to prevent the > CPU from executing code, and a global counter of pending jobs; > - implements rather complicated scheduling of jobs relying on both the > per-CPU flag and the global counter; > - may be not entirely safe when draining work queues if multiple CPUs > have scheduled safe work; > - does not support user-mode emulation. > > Alex's reiteration of Fred's approach [2]: > - maintains a single global safe work queue; > - uses GArray rather than linked list to implement the work queue; > - introduces a global counter of CPUs which have entered their execution > loop; > - makes use of the last CPU exited its execution loop to drain the safe > work queue; > - still does not support user-mode emulation. > > Alvise's async_wait_run_on_cpu() [3]: > - uses the same queue as async_run_on_cpu(); > - the CPU that requested the job is recorded in qemu_work_item; > - each CPU has a counter of such jobs it has requested; > - the counter is decremented upon job completion; > - only the target CPU is forced to exit the execution loop, i.e. the job > is not run in quiescent state; async_wait_run_on_cpu() kicks the target VCPU before calling cpu_exit() on the current VCPU, so all the VCPUs are forced to exit. Moreover, the current VCPU waits for all the tasks to be completed. > - does not support user-mode emulation. > > Emilio's cpu_tcg_sched_work() [4]: > - exploits tb_lock() to force CPUs exit their execution loop; > - requires 'tb_lock' to be held when scheduling a job; > - allows each CPU to schedule only a single job; > - handles scheduled work right in cpu_exec(); > - exploits synchronize_rcu() to wait for other CPUs to exit their > execution loop; > - implements a complicated synchronization scheme; > - should support both system and user-mode emulation. > > > As of requirements for common safe work mechanism, each use case has its > own considerations. > > Translation buffer flush just requires that no CPU is executing > generated code during the operation. > > Cross-CPU TLB flush basically requires no CPU is performing TLB > lookup/modification. Some architectures might require TLB flush be > complete before the requesting CPU can continue execution; other might > allow to delay it until some "synchronization point". In case of ARM, > one of such synchronization points is DMB instruction. We might allow > the operation to be performed asynchronously and continue execution, but > we'd need to end TB and synchronize on each DMB instruction. That > doesn't seem very efficient. So a simple approach to force the operation > to complete before executing anything else would probably make sense in > both cases. Slow-path LL/SC emulation also requires cross-CPU TLB flush > to be complete before it can finish emulation of a LL instruction. > > Exclusive operation emulation in user-mode basically requires that no > other CPU is executing generated code. However, I hope that both system > and user-mode would use some common implementation of exclusive > instruction emulation. > > It was pointed out that special care must be taken to avoid deadlocks > [5, 6]. A simple and reliable approach might be to exit all CPU's > execution loop including the requesting CPU and then serve all the > pending requests. > > Distilling the requirements, safe work mechanism should: > - support both system and user-mode emulation; > - allow to schedule an asynchronous operation to be performed out of CPU > execution loop; > - guarantee that all CPUs are out of execution loop before the operation > can begin; This requirement is probably not necessary if we need to query TLB flushes to other VCPUs, since every VCPU will flush its own TLB. For this reason we probably need to mechanisms: - The first allows a VCPU to query a job to all the others and wait for all of them to be done (like for global TLB flush) - The second allows a VCPU to perform a task in quiescent state i.e. the task starts and finishes when all VCPUs are out of the execution loop (translation buffer flush) Does this make sense? > - guarantee that no CPU enters execution loop before all the scheduled > operations are complete. This is probably too much in some cases for the reasons of before. Best regards, alvise > > If that sounds like a sane approach, I'll come up with a more specific > solution to discuss. The solution could be merged into v2.7 along with > safe translation buffer flush in user-mode as an actual use case. Safe > cross-CPU TLB flush would become a part of MTTCG work. Comments, > suggestions, arguments etc. are welcome! > > [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632 > [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039 > [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982 > [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789 > [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301 > [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231 > > Kind regards, > Sergey