Re: Kernel panic in smp rendezvous

2017-05-11 Thread Konstantin Belousov
On Wed, May 10, 2017 at 03:29:39PM -0700, Olivier Cinquin wrote:
> Hi,
> I'm getting the following kernel panics on recent 11-STABLE (r317883):
> 
> spin lock 0x81df43d0 (smp rendezvous) held by 0xf8019c7a7000 (tid 
> 100845) too long
> timeout stopping cpus
> panic: spin lock held too long
> cpuid = 12
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe3e64caf2a0
> vpanic() at vpanic+0x186/frame 0xfe3e64caf320
> panic() at panic+0x43/frame 0xfe3e64caf380
> _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x311/frame 
> 0xfe3e64caf3f0
> smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x101/frame 
> 0xfe3e64caf470
> smp_masked_invlpg_range() at smp_masked_invlpg_range+0x50/frame 
> 0xfe3e64caf4a0
> pmap_invalidate_range() at pmap_invalidate_range+0x196/frame 
> 0xfe3e64caf4e0
> vm_thread_stack_dispose() at vm_thread_stack_dispose+0x2f/frame 
> 0xfe3e64caf530
> thread_free() at thread_free+0x39/frame 0xfe3e64caf550
> thread_reap() at thread_reap+0x10e/frame 0xfe3e64caf570
> proc_reap() at proc_reap+0x6bd/frame 0xfe3e64caf5b0
> proc_to_reap() at proc_to_reap+0x48c/frame 0xfe3e64caf600
> kern_wait6() at kern_wait6+0x49e/frame 0xfe3e64caf6b0
> sys_wait4() at sys_wait4+0x78/frame 0xfe3e64caf8a0
> amd64_syscall() at amd64_syscall+0x6c4/frame 0xfe3e64cafa30
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfe3e64cafa30
> 
> 
> The panics occur sporadically and seemingly randomly. They occur on
> multiple machines with the same configuration, and on older revisions of
> 11-STABLE as well as r317883 (I cannot easily bisect this down to a
> specific commit given that it can take days or sometimes weeks for the
> panic to occur). Note that my kernel is compiled with "options IPSEC", but
> that does not seem to be relevant.
> 
> My understanding of the overall problem is that an IPI is sent to all cores
> by smp_rendezvous_cpus, but that some of the cores do not respond to it (or
> at least do not respond to it correctly and in time). More specifically, I
> can find 61 threads that seem to be blocked in cpustop_handler on an atomic
> operation to indicate that they have stopped in response to the IPI. This
> operation is CPU_SET_ATOMIC(cpu, &stopped_cpus); CPU_SET_ATOMIC boils down
> to a call to "atomic_set_long" (not sure where that is itself defined for
> amd64). Either there is a line numbering problem, or this suggests that
> perhaps there is a deadlock at this step (unless there's some sort of
> livelock and CPU_SET_ATOMIC is the place where the threads spend most of
> their time).
> 
> Looking at the thread backtraces as a whole, I can see the one that
> triggered the panic, the 61 that are in cpustop_handler, and a lot of
> sleeping threads. I guess each core should have an active thread, and this
> is an architecture with 64 cores, so that leaves 64 - (1 + 61) = 2 cores
> that are unaccounted for. Interestingly, for 2 different panics for which I
> have a vmcore file these numbers are the same. For these two panics, the
> IDs of the cores that are neither in cpustop_handler nor calling the smp
> rendez vous are not the same (#18 and #19 in one instance, #44 and #45 in
> the other instance) but in both instances the IDs are consecutive; perhaps
> that's relevant.
> 
> I've uploaded a full set of backtraces at
> https://gist.github.com/cinquin/d63cdf9de01b0b8033c47128868f2d38 and a
> dmesg at https://gist.github.com/cinquin/17c0cf6ac7832fd1b683488d08d3e69b
> (in short, the machine is a 4 processor Opteron 6274 system). I'm pasting 
> below
> an example backtrace of the threads in stopcpu_handler.
> 
> Any suggestions as to what the problem might be and how to solve it? I've
> saved the core and can provide further information.
> 
> Thanks
> Olivier Cinquin
> 
> A total of 61 threads are along the lines of
> 
> #0  cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1275
> #1  0x8105a6f5 in ipi_nmi_handler () at 
> /usr/src/sys/x86/x86/mp_x86.c:1231
> #2  0x80ef740a in trap (frame=0xfe3e687f7f30) at 
> /usr/src/sys/amd64/amd64/trap.c:185
> #3  0x80edbd03 in nmi_calltrap () at 
> /usr/src/sys/amd64/amd64/exception.S:510
> ...
> (kgdb) frame 0
> #0  cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1275
> 1275  CPU_SET_ATOMIC(cpu, &stopped_cpus);
> (kgdb) list
> 1270  cpu = PCPU_GET(cpuid);
> 1271  
> 1272  savectx(&stoppcbs[cpu]);
> 1273  
> 1274  /* Indicate that we are stopped */
> 1275  CPU_SET_ATOMIC(cpu, &stopped_cpus);
> 1276  
> 1277  /* Wait for restart */
> 1278  while (!CPU_ISSET(cpu, &started_cpus))
> 1279  ia32_pause();
> (kgdb) p stopped_cpus
> $1 = {__bits = 0x81df4368}
> (kgdb) p started_cpus
> $2 = {__bits = 0x81df4418}
> 

It would be interesting to see only the backtraces for threads which are
on CPUs, i.e. all nmi_stop_handler()-threads and others.

>From your d

Re: FreeBSD Vagrant Box Kernel Panic

2017-05-11 Thread Patrick M. Hausen
Hello,

> Am 10.05.2017 um 21:24 schrieb Robert Simmons :
> It appears that the maintainer of the boxes should create a
> 11.0-RELEASE-p10 box and revoke the -p1 box.

The official Hashicorp boxes have been invoking freebsd-update on startup
for a while, now. So every box automatically updates itself to the latest
patchlevel.

We have started to roll our own boxes, so I'm not perfectly up-to-date.

Patrick


signature.asc
Description: Message signed with OpenPGP