On Thursday 26 July 2018 23:23:13 Taylor R Campbell wrote: > Is this a conceptual problem, or do you have a symptom that you're > actually hitting with specific code? If the latter, can you describe > the symptom and quote the code?
Yes, this a real problem I'm having. This is my real "f()": struct urtwn_tx_data * urtwn_get_tx_data(struct urtwn_softc *sc, size_t pidx) { struct urtwn_tx_data *data = NULL; mutex_enter(&sc->sc_tx_mtx); if (!TAILQ_EMPTY(&sc->tx_free_list[pidx])) { data = TAILQ_FIRST(&sc->tx_free_list[pidx]); TAILQ_REMOVE(&sc->tx_free_list[pidx], data, next); } mutex_exit(&sc->sc_tx_mtx); return data; } I'm getting a mutex error here in that the lock is held. Backtrace: System panicked: LOCKDEBUG: Mutex error: mutex_vector_enter,528: spin lock held Backtrace from time of crash is available. crash> bt _KERNEL_OPT_NARCNET() at 0 _KERNEL_OPT_ACPI_SCANPCI() at _KERNEL_OPT_ACPI_SCANPCI vpanic() at vpanic+0x17d snprintf() at snprintf lockdebug_more() at lockdebug_more mutex_enter() at mutex_enter+0x6b6 urtwn_get_tx_data() at urtwn_get_tx_data+0x22 urtwn_raw_xmit() at urtwn_raw_xmit+0x3e ieee80211_raw_output() at ieee80211_raw_output+0x68 ieee80211_send_probereq() at ieee80211_send_probereq+0x326 scan_curchan() at scan_curchan+0x3c scan_start() at scan_start+0x2b0 workqueue_worker() at workqueue_worker+0xe9 I'm seeing no evidence that scan_start() has been run twice and I'm not seeing any other debug messages that even say that urtwn_get_tx_data is being called again. I can't snoop at crash time because my usb keyboard quits working on a panic. I've been using crash to snoop but I'm not that good at it yet. An "ifconfig urtwn0 up" started the scan but it appears that ifconfig is no longer running. I see only one lwp running "net80211_wq". This particular mutex gets called from the the usb softintr at the end of a transmit. So with ifconfig no longer running, it can't be that ifconfig is calling a transmit function from the original thread and then calling urtwn_get_tx_data(). During normal running, this mutex is called in a transmit path (urtwn_start() the if_start function and urtwn_raw_xmit() that is used by the 80211 layer in areas like the scan where afaict, they are management frames) and in the urtwn_txeof() which is the report back that a transmit was done. I'm assuming that the softintr and the workqueue don't look like the same owner. So I'm stuck wondering what is happening here. Even though I don't see the scan_start called twice, I do need to protect against that. I'll see if that fixes the problem. --Phil