While testing my RA patches, I've seen iwn "hang" even though the AP and iwn client were still exchanging packets at the wifi layer, but the upper layers IP/UDP/TCP etc. were stuck. I could easily trigger this by running tcpbench and moving towards the edge of the range of my AP. Traffic would recover by itself after a while.
I can now explain why this is happening and suggest a fix. When a packet on an aggregation queue fails, the driver sends a block ack request to the receiver. This request contains the current starting sequence number (SSN) of the firmware's block ack window. The purpose of this request is to let the receiver know that any frames lower than the SSN should be discarded. Only frames within (SSN, SSN+window-size) are valid. In other words, we are trying to "resync" our block ack Tx window with the peer after a Tx failure. Most of the time, this works fine. We send one block ack request with SSN X and the receiver sends a block ack, having adjusted its receive window such that X becomes its new lower bound. For example: iwn: bock ack request SSN=2551 AP: bock ack SSN=2551 Sometimes however, the firmware (not the driver) sends another block ack request immediately after the AP's block ack is received, and such block ack requests contain a bogus SSN. This shows up in monitor mode traces: iwn: bock ack request SSN=2551 AP: bock ack SSN=2551 iwn firmware: bock ack request SSN=0 AP: bock ack SSN=0 Now the receiver is out of sync, and will discard frames until iwn's sending window wraps back to zero. The firmware will happily keep transmitting frames with sequence numbers 2552, 2553, and traffic is restored when it finally wraps around at 0xfff == 4095. In the cases I observed, the driver-generated BA request was sent at 6 Mbit/s, which is expected. However, the second frame was sent at 24 Mbit/s, which indicates that the firmware could be retrying the BA request (frames sent at a different Tx rate than specified by the driver are generally retries). BA req frames are control frames, and our driver is sending any such non-data frames via firmware's broadcast node. This node does not represent the AP. Sending BA req frames with the firmware node which represents the AP seems to fix the problem. I have not yet managed to trigger it again with this patch. My best explanation is that this allows the firmware to retry block ack requests properly, and to stop retrying once a BA is received from the AP. ok? diff 1ff4cf56fdff3473d72fc4b29d69428c688d47c6 /usr/src (staged changes) blob - afeb963ef626d2e98018b1d405c35936d96ba4e1 blob + 50005e1511b06c99e72952110e4f06fb30cb818b --- sys/dev/pci/if_iwn.c +++ sys/dev/pci/if_iwn.c @@ -3505,7 +3505,10 @@ iwn_tx(struct iwn_softc *sc, struct mbuf *m, struct ie } } - if (IEEE80211_IS_MULTICAST(wh->i_addr1) || + if (type == IEEE80211_FC0_TYPE_CTL && + subtype == IEEE80211_FC0_SUBTYPE_BAR) + tx->id = wn->id; + else if (IEEE80211_IS_MULTICAST(wh->i_addr1) || type != IEEE80211_FC0_TYPE_DATA) tx->id = sc->broadcast_id; else