[I forgot to add the cover letter to git send-email; here it is] v3: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg04179.html
"Why is this an RFC?" See v3 link above. Also, see comment at the bottom of this message regarding the last patch of this series. Changes since v3: - Add R-b's -- rth: thanks for all the reviews! - Partially bring back the approach in the v1 series, that is, do not acquire the CPU mutex in cpu_interrupt_request; use atomic_read instead. The lock is not needed because of the OR+kick sequence; note that when we do not have the BQL (i.e. in MTTCG's CPU loop), we do have the lock while reading cpu_interrupt_request, so no writes can be missed there. This simplifies the calling code quite a bit, since we do not hold cpu_mutex across large portions of code (this is a very bad idea, e.g. could lead to deadlock for instance if we tried to queue work on another CPU). Setters still acquire the lock, since it's the cost is very similar to that of a locked atomic, and makes the code simpler. - Use an intermediate interrupt_request variable wherever this is trivial to do. Whenever there's a chance that cpu->interrupt_request might have been updated between reads, just use cpu_interrupt_request to perform a load via atomic_read. - Drop the BQL assert in run_on_cpu. Keep the guarantee that the queued work holds the BQL, though. - Add a no_cpu_mutex_lock_held assert to run_on_cpu; note that acquiring CPU locks can only be done in CPU_FOREACH order, otherwise we might deadlock. - Remove qemu_cpu_cond leftovers; it's superseded by cpu->cond. - Export no_cpu_mutex_lock_held. - Do not export process_queued_work_locked from cpus-common.c; at some point we need to drop cpu_mutex for other vCPUs to queue work on the current vCPU, and process_queued_work is a good place to do that. Add comment about this. - Add helper_cpu_halted_set helper to tcg-runtime. Convert the TCG targets that were setting cpu->halted directly. - Fix cpu_reset_interrupt in cpu_common_post_load, as reported by Richard. - Fix a hang after making qemu_work_cond per-cpu. The hang can only happen between that commit and the commit that completes the transition to per-CPU locks. This would break bisect, so fix it. In addition, add a comment about it, and add a tiny patch to undo the fix once the transition is complete and the fix isn't needed any more. - Fix a possible deadlock (acquiring the BQL *after* having acquired the CPU mutex) in cpu_reset_interrupt. This would break bisection until the transition to per-CPU locks, so fix it and add a comment about it to the commit log of the "cpu: define cpu_interrupt_request helpers" patch. - Add cc->has_work_with_iothread_lock, as suggested by Paolo, and convert the targets that need it. - Add a patch to reduce contention when doing exclusive work. In my tests I've only found aarch64 to benefit (very minimally) from this; for other targets this patch is perf-neutral. In aarch64, the problem is that frequent global TLB invalidations require frequent exclusive work; sync-profile points to unnecessary contention on cpu_list_lock, which all waiters wait on while the exclusive work completes. To fix this, make the waiters wait on their CPU lock, so that their wakeup is uncontended. The perf impact is that the scalability collapse due to exclusive work is mitigated, but not fixed. So I'm including this patch to raise awareness about the issue, but I don't feel strongly at all about merging it. The series is checkpatch-clean (single warning about __COVERITY__). You can fetch it from: https://github.com/cota/qemu/tree/cpu-lock-v4 Thanks, Emilio