On 22.4.2021. 11:02, Alexander Bluhm wrote: > On Thu, Apr 22, 2021 at 09:03:22AM +0200, Hrvoje Popovski wrote: >> something like this: >> >> x3550m4# pappnaiannc:iicc :p:o ppoolo_oolcla__ddcohoe__gg_eiettt::e m >> _mmcbmualg2fkpilc2_:: chppeaag >> gceke: ee mmmbppttuyfyp > > This was without my kernel lock around ARP bandage, right?
yes, yes ... > >> ddb{9}> mach ddbcpu 0xa >> Stopped at x86_ipi_db+0x12: leave >> x86_ipi_db(ffff800021a2aff0) at x86_ipi_db+0x12 >> x86_ipi_handler() at x86_ipi_handler+0x80 >> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23 >> pool_get(ffffffff8221e568,2) at pool_get+0x43 >> m_gethdr(2,1) at m_gethdr+0x3f >> rtm_msg1(e,ffff800026e3cf70) at rtm_msg1+0x4c >> rtm_ifchg(ffff8000005b3800) at rtm_ifchg+0x61 >> if_down(ffff8000005b3800) at if_down+0xa0 >> if_downall() at if_downall+0x5b >> boot(104) at boot+0x99 >> reboot(104) at reboot+0x5b >> panic(ffffffff81df855b) at panic+0x132 >> pool_do_get(ffffffff8221ebc8,2,ffff800026e3d294) at pool_do_get+0x309 >> pool_get(ffffffff8221ebc8,2) at pool_get+0x95 >> end trace frame: 0xffff800026e3d340, count: 0 >> >> ddb{10}> mach ddbcpu 0xb >> Stopped at db_enter+0x10: popq %rbp >> db_enter() at db_enter+0x10 >> panic(ffffffff81df855b) at panic+0x12a >> pool_do_get(ffffffff8221e568,2,ffff800026e43294) at pool_do_get+0x309 >> pool_get(ffffffff8221e568,2) at pool_get+0x95 >> m_clget(0,2,802) at m_clget+0xdd >> ixgbe_get_buf(ffff80000015c0e8,e) at ixgbe_get_buf+0xa3 >> ixgbe_rxfill(ffff80000015c0e8) at ixgbe_rxfill+0xae >> ixgbe_queue_intr(ffff80000011ac40) at ixgbe_queue_intr+0x4f >> intr_handler(ffff800026e434b0,ffff8000000cd700) at intr_handler+0x6e >> Xintr_ioapic_edge4_untramp() at Xintr_ioapic_edge4_untramp+0x18f >> acpicpu_idle() at acpicpu_idle+0x1ea >> sched_idle(ffff800021a33ff0) at sched_idle+0x27e >> end trace frame: 0x0, count: 3 > > Two processors 10 and 11 in pool get. > > CPU 10 does pool_get, panic, boot, pool_get again. > CPU 11 was the one that originally stopped in ddb. > > Did you enter boot reboot before doing mach ddbcpu 0xa? nope... is doing that ever useful? > Or how did we get the boot sequence in this trace? > > Can it be that both CPU paniced simultaeously? The mangled massage > indicates this. Then cpu 10 saw that cpu 11 already paniced to ddb > and tried to reboot. There it paniced again and got stuck in a > recursive call to pool_get(). > > The if (db_panic) in the panic() function was not written with > simultaneous panics on multiple CPUs in mind. if you want i'll try to reproduce in on other boxes.. maybe i can trigger it here easily because of 2 sockets ?