On 04/02/2015 01:26 PM, Ingo Molnar wrote: > > * Linus Torvalds <torva...@linux-foundation.org> wrote: > >> So unless we find a real clear signature of the bug (I was hoping >> that the ISR bit would be that sign), I don't think trying to bisect >> it based on how quickly you can reproduce things is worthwhile. > > So I'm wondering (and I might have missed some earlier report that > outlines just that), now that the possible location of the bug is > again sadly up to 15+ million lines of code, I have no better idea > than to debug by symptoms again: what kind of effort was made to > examine the locked up state itself? >
Ingo, Rafael did some analysis when I was out earlier here: https://lkml.org/lkml/2015/2/23/234 My reproducer setup is as follows: L0 - 8-way CPU, 48 GB memory L1 - 2-way vCPU, 4 GB memory L2 - 1-way vCPU, 1 GB memory Stress is only run in the L2 VM, and running top on L0/L1 doesn't show excessive load. > Softlockups always have some direct cause, which task exactly causes > scheduling to stop altogether, why does it lock up - or is it not a > clear lockup, just a very slow system? > > Thanks, > > Ingo > Whenever we look through the crashdump we see csd_lock_wait waiting for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading up to that looks like the following (in the openstack tempest on openstack and nested VM stress case) (qemu-system-x86 task) kvm_sched_in -> kvm_arch_vcpu_load -> vmx_vcpu_load -> loaded_vmcs_clear -> smp_call_function_single (ksmd task) pmdp_clear_flush -> flush_tlb_mm_range -> native_flush_tlb_others -> smp_call_function_many --chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/