Pedro, On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote: > On Wed, Dec 14, 2022 at 11:37:14AM +0000, Pedro Caetano wrote: > > Hi bugs@ > > > > In the process of upgrading a pair of servers to release 7.2, the following > > panic was triggered after sysupgrade reboot. (dell poweredge R740) > > > > One of the reboots happened before syspatch, the other happened after > > applying the release patches. > > > > After powercycling, both servers managed to boot successfully. > > > > Please keep me copied as I'm not subscribed to bugs@ > > > > > > Screenshot of the panic uploaded attached to this email. > > For reference: > > cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line > 16-way L2 cache, 8MB 64b/line 11-way L3 cache > cpu2: smt 0, core 5, package 0 > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1 > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0: > apid 26 (application process > > Somehow your machine is violating one of the TSC sync test sanity > checks. The idea behind this one is that there should only be one AP > in the sync test at a time. > > At the start of each test, in tsc_test_sync_ap(), the AP sets > tsc_ap_name to its dv_xname. It does this with an atomic CAS > expecting NULL to ensure no other AP is still running the sync test. > You're hitting this panic: > > 449 void > 450 tsc_test_sync_ap(struct cpu_info *ci) > 451 { > 452 if (!tsc_is_invariant) > 453 return; > 454 #ifndef TSC_DEBUG > 455 if (!tsc_is_synchronized) > 456 return; > 457 #endif > 458 /* The BP needs our name in order to report any problems. */ > 459 if (atomic_cas_ptr(&tsc_ap_name, NULL, ci->ci_dev->dv_xname) > != NULL) { > 460 panic("%s: %s: tsc_ap_name is not NULL: %s", > 461 __func__, ci->ci_dev->dv_xname, tsc_ap_name); > 462 } > > The BP is supposed to reset tsc_ap_name to NULL at the conclusion of > every sync test, from tsc_test_sync_bp(): > > 415 /* > 416 * Report what happened. Adjust the TSC's quality > 417 * if this is the first time we've failed the test. > 418 */ > 419 tsc_report_test_results(); > 420 if (tsc_ap_status.lag_count || > tsc_bp_status.lag_count) { > 421 if (tsc_is_synchronized) { > 422 tsc_is_synchronized = 0; > 423 tc_reset_quality(&tsc_timecounter, > -1000); > 424 } > 425 tsc_test_rounds = 0; > 426 } else > 427 tsc_test_rounds--; > 428 > 429 /* > 430 * Clean up for the next round. It is safe to reset > the > 431 * ingress barrier because at this point we know the > AP > 432 * has reached the egress barrier. > 433 */ > 434 memset(&tsc_ap_status, 0, sizeof tsc_ap_status); > 435 memset(&tsc_bp_status, 0, sizeof tsc_bp_status); > 436 tsc_ingress_barrier = 0; > 437 if (tsc_test_rounds == 0) > 438 tsc_ap_name = NULL; > > It's possible the BP's store: > > tsc_ap_name = NULL; > > is not *always* globally visible by the time the next AP reaches the > tsc_ap_name CAS, triggering the panic. If so, we could force the > store to complete with membar_producer(). tsc_ap_name should be > volatile, too. > > OTOH, it's possible this particular check is not the right thing here. > My intention is correct... we definitely don't want more than one AP > in the sync test at any given moment. But this tsc_ap_name handshake > thing may be the wrong way to assert that.
Just following up on this. Have you seen the panic you reported again?