Re: ral(4) leaks mbufs instead of setting oactive
On Thu, 11 Feb 2016, Richard Procter wrote: > [...] I found a benign off-by-one error in [my] code that left a > tx descriptor unused. [...] > > I've not tested this tweak as I'm unable to access the hardware for a few > weeks, unfortunately. Now tested, via concurrent $ cat /dev/zero | nc sink sink-port # tcpdump -i ral0 -n not port 22 on the ral(4) machine over a ssh session. This fills the tx ring on my alix 2d2. Updated patch below fixes a comment and removes the second KASSERT. best, Richard. ral0 at pci0 dev 12 function 0 "Ralink RT2860" rev 0x00: irq 9, address xx:xx:xx:xx:xx:xx ral0: MAC/BBP RT2860 (rev 0x0103), RF RT2850 (MIMO 2T3R) * fix watchdog timeouts and dropped frames under load. - on full tx ring, ring->cur wraps to an active tx descriptor. Passing that wrapped value to the card was observed to cause general flakiness. Fix prevents the wrap at the cost of reducing usable tx descriptors by one. Index: sys/dev/ic/rt2860.c === --- sys.orig/dev/ic/rt2860.c +++ sys/dev/ic/rt2860.c @@ -1171,7 +1171,7 @@ rt2860_tx_intr(struct rt2860_softc *sc, } sc->sc_tx_timer = 0; - if (ring->queued < RT2860_TX_RING_COUNT) + if (ring->queued < RT2860_TX_RING_MAX) sc->qfullmsk &= ~(1 << qid); ifq_clr_oactive(>if_snd); rt2860_start(ifp); @@ -1618,7 +1618,7 @@ rt2860_tx(struct rt2860_softc *sc, struc /* determine how many TXDs are required */ ntxds = 1 + (data->map->dm_nsegs / 2); - if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) { + if (ring->queued + ntxds >= RT2860_TX_RING_MAX) { /* not enough free TXDs, force mbuf defrag */ bus_dmamap_unload(sc->sc_dmat, data->map); error = EFBIG; @@ -1656,7 +1656,7 @@ rt2860_tx(struct rt2860_softc *sc, struc /* determine how many TXDs are now required */ ntxds = 1 + (data->map->dm_nsegs / 2); - if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) { + if (ring->queued + ntxds >= RT2860_TX_RING_MAX) { /* this is a hopeless case, drop the mbuf! */ bus_dmamap_unload(sc->sc_dmat, data->map); m_freem(m); @@ -1714,7 +1714,7 @@ rt2860_tx(struct rt2860_softc *sc, struc ring->cur = (ring->cur + 1) % RT2860_TX_RING_COUNT; ring->queued += ntxds; - if (ring->queued >= RT2860_TX_RING_COUNT) + if (ring->queued >= RT2860_TX_RING_MAX) sc->qfullmsk |= 1 << qid; /* kick Tx */ Index: sys/dev/ic/rt2860var.h === --- sys.orig/dev/ic/rt2860var.h +++ sys/dev/ic/rt2860var.h @@ -17,8 +17,9 @@ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */ -#define RT2860_TX_RING_COUNT 64 #define RT2860_RX_RING_COUNT 128 +#define RT2860_TX_RING_COUNT 64 +#define RT2860_TX_RING_MAX (RT2860_TX_RING_COUNT - 1) #define RT2860_TX_POOL_COUNT (RT2860_TX_RING_COUNT * 2) #define RT2860_MAX_SCATTER ((RT2860_TX_RING_COUNT * 2) - 1) * replace custom defrag with m_defrag() - This fixes an error in the existing code: the "hopeless case" guard equivales 'ring now full', so oactive is never set: the code drops any mbuf that would fill the ring. This occurs often in practice. - The preceding patch allows the ring to fill safely. - The new code avoids some hoop-jumping. Currently, one tx dma-map can fill the tx ring. Therefore an mbuf that fits a dma-map may yet not fit into the tx ring's remaining space. To be sure it can, we must in general count the mbuf's fragments and, if necessary, defrag it and reload the dmamap. The new code limits the dmamap to cover at most 8 tx descriptors (= 15 fragments): now, if an mbuf fits a dma-map it will fit any ring with at least 8 free descriptors. So we need only check for 8 free descriptors and are longer obliged to count fragments and jump hoops. The cost is unused tx ring descriptors, at most 7 of 63, when a one-fragment mbuf occupies one descriptor in the last block of 8. - For simplicity on error return, shift responsibilty for calling m_freem() to rt2860_tx()'s caller (which already calls ieee80211_release_node()). Index: sys/dev/ic/rt2860.c === --- sys.orig/dev/ic/rt2860.c +++ sys/dev/ic/rt2860.c @@ -567,7 +567,7 @@ rt2860_alloc_tx_pool(struct rt2860_softc error = bus_dmamap_create(sc->sc_dmat, MCLBYTES, RT2860_MAX_SCATTER, MCLBYTES, 0, BUS_DMA_NOWAIT, - >map); + >map); /* <0> */ if (error != 0) { printf("%s: could not create DMA map\n",
Re: ral(4) leaks mbufs instead of setting oactive
On Wed, Feb 03, 2016 at 04:41:34PM +1300, Richard Procter wrote: > > On Tue, 2 Feb 2016, Stefan Sperling wrote: > > > On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote: > > > - ring->queued--; > > > + atomic_dec_int(>queued); > > > > > - ring->queued += ntxds; > > > + atomic_add_int(>queued, ntxds); > > > > I don't think these make a difference in the current way of things. > > Wireless drivers run interrupts under the kernel big lock, interrupts > > aren't preemptible, and AFAIK (most?) 32bit integer operations are atomic. > [...] > > Hmm. Taking a closer look, if_start() is already called under splnet. > > So adding splnet to rt2860_tx() shouldn't make a difference. > > You're right, the atomic is unnecessary. I'd assumed rt2860_tx() was > running under splsoftnet. I could have sworn I'd seen errors without the > atomic but that now tests fine, too. (I still see errors without the > ring->cur fix.) > > > This also means the card cannot interrupt in the way your comment > > describes, i.e. the problem you're "fixing" here cannot exist... ? > > Also right. > > This simplifies things --- see below for the patch minus the above. > Without it my card under stress sees 1 oerror per ~217 packets; it's > now sent 5E6 without seeing any. > > cheers, > Richard. I think these changes are good. I'm testing them with 2 different rt2860 style ral(4) cards. I don't have the nice model you've got, though :-( Mine are 2GHz only which makes it a bit difficult to push packets out fast. There's a lot of noise from other networks. One more thing I'd like to ask: Could you transform the following paragraph you wrote into a comment somewhere in the code? > The new code limits the dmamap to cover at most 8 tx descriptors (= 15 > fragments): now, if an mbuf fits a dma-map it will fit any ring with at least > 8 free descriptors. So we need only check for 8 free descriptors and are > longer obliged to count fragments and jump hoops. The cost is unused tx > ring descriptors, at most 7 of 63, when a one-fragment mbuf occupies one > descriptor in the last block of 8.
Re: ral(4) leaks mbufs instead of setting oactive
On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote: > - ring->queued--; > + atomic_dec_int(>queued); > - ring->queued += ntxds; > + atomic_add_int(>queued, ntxds); I don't think these make a difference in the current way of things. Wireless drivers run interrupts under the kernel big lock, interrupts aren't preemptible, and AFAIK (most?) 32bit integer operations are atomic. > * fix dropped frames, watchdog timeouts. > > - on full tx ring, ring->cur wraps to an active tx descriptor. Passing > that wrapped value to the card was observed to cause general flakiness. Nice find. > * replace custom defrag with m_defrag() Also good. We've already done this in some other drivers. > + /* XXX card may interrupt here and invalidate this guard; the You can easily prevent the card from interrupting by making rt2860_tx() call splnet() before modifying data shared with the interrupt handler. I think that's the real bug you're looking for. Could you try that and send an updated diff if it works? Thank you very much for your work on this. I wrote some fixes for the rt2560 line of ral chips some years ago fixing similar driver bugs.
Re: ral(4) leaks mbufs instead of setting oactive
On Tue, Feb 02, 2016 at 09:14:06AM +0100, Stefan Sperling wrote: > On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote: > > + atomic_add_int(>queued, ntxds); > > + /* XXX card may interrupt here and invalidate this guard; the > > You can easily prevent the card from interrupting by making rt2860_tx() > call splnet() before modifying data shared with the interrupt handler. > I think that's the real bug you're looking for. > Could you try that and send an updated diff if it works? Hmm. Taking a closer look, if_start() is already called under splnet. So adding splnet to rt2860_tx() shouldn't make a difference. This also means the card cannot interrupt in the way your comment describes, i.e. the problem you're "fixing" here cannot exist... ?
Re: ral(4) leaks mbufs instead of setting oactive
On 02/02/16(Tue) 09:14, Stefan Sperling wrote: > On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote: > > - ring->queued--; > > + atomic_dec_int(>queued); > > > - ring->queued += ntxds; > > + atomic_add_int(>queued, ntxds); > > I don't think these make a difference in the current way of things. It does not, the wireless stack needs some love before wifi drivers can have their interrupt handlers marked as mp-safe. > Wireless drivers run interrupts under the kernel big lock, interrupts > aren't preemptible, and AFAIK (most?) 32bit integer operations are atomic. Adding/decrementing are not because you need to reading the existing value then write the modified one :)
Re: ral(4) leaks mbufs instead of setting oactive
On Tue, 2 Feb 2016, Stefan Sperling wrote: > > On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote: > > - ring->queued--; > > + atomic_dec_int(>queued); > > > - ring->queued += ntxds; > > + atomic_add_int(>queued, ntxds); > > I don't think these make a difference in the current way of things. > Wireless drivers run interrupts under the kernel big lock, interrupts > aren't preemptible, and AFAIK (most?) 32bit integer operations are atomic. [...] > Hmm. Taking a closer look, if_start() is already called under splnet. > So adding splnet to rt2860_tx() shouldn't make a difference. You're right, the atomic is unnecessary. I'd assumed rt2860_tx() was running under splsoftnet. I could have sworn I'd seen errors without the atomic but that now tests fine, too. (I still see errors without the ring->cur fix.) > This also means the card cannot interrupt in the way your comment > describes, i.e. the problem you're "fixing" here cannot exist... ? Also right. This simplifies things --- see below for the patch minus the above. Without it my card under stress sees 1 oerror per ~217 packets; it's now sent 5E6 without seeing any. cheers, Richard. --- * fix watchdog timeouts and dropped frames under load. - on full tx ring, ring->cur wraps to an active tx descriptor. Passing that wrapped value to the card was observed to cause general flakiness. Fix prevents the wrap at the cost of reducing usable tx descriptors by one. Index: sys/dev/ic/rt2860.c === --- sys.orig/dev/ic/rt2860.c +++ sys/dev/ic/rt2860.c @@ -1171,7 +1171,7 @@ rt2860_tx_intr(struct rt2860_softc *sc, } sc->sc_tx_timer = 0; - if (ring->queued < RT2860_TX_RING_COUNT) + if (ring->queued < RT2860_TX_RING_MAX) sc->qfullmsk &= ~(1 << qid); ifq_clr_oactive(>if_snd); rt2860_start(ifp); @@ -1618,7 +1618,7 @@ rt2860_tx(struct rt2860_softc *sc, struc /* determine how many TXDs are required */ ntxds = 1 + (data->map->dm_nsegs / 2); - if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) { + if (ring->queued + ntxds >= RT2860_TX_RING_MAX) { /* not enough free TXDs, force mbuf defrag */ bus_dmamap_unload(sc->sc_dmat, data->map); error = EFBIG; @@ -1656,7 +1656,7 @@ rt2860_tx(struct rt2860_softc *sc, struc /* determine how many TXDs are now required */ ntxds = 1 + (data->map->dm_nsegs / 2); - if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) { + if (ring->queued + ntxds >= RT2860_TX_RING_MAX) { /* this is a hopeless case, drop the mbuf! */ bus_dmamap_unload(sc->sc_dmat, data->map); m_freem(m); @@ -1714,7 +1714,7 @@ rt2860_tx(struct rt2860_softc *sc, struc ring->cur = (ring->cur + 1) % RT2860_TX_RING_COUNT; ring->queued += ntxds; - if (ring->queued >= RT2860_TX_RING_COUNT) + if (ring->queued >= RT2860_TX_RING_MAX) sc->qfullmsk |= 1 << qid; /* kick Tx */ Index: sys/dev/ic/rt2860var.h === --- sys.orig/dev/ic/rt2860var.h +++ sys/dev/ic/rt2860var.h @@ -17,8 +17,9 @@ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */ -#define RT2860_TX_RING_COUNT 64 #define RT2860_RX_RING_COUNT 128 +#define RT2860_TX_RING_COUNT 64 +#define RT2860_TX_RING_MAX (RT2860_TX_RING_COUNT - 1) #define RT2860_TX_POOL_COUNT (RT2860_TX_RING_COUNT * 2) #define RT2860_MAX_SCATTER ((RT2860_TX_RING_COUNT * 2) - 1) --- * replace custom defrag with m_defrag() - This fixes an error in the existing code: the "hopeless case" guard equivales 'ring now full', so oactive is never set: the code drops any mbuf that would fill the ring. This occurs often in practice. - The preceding patch allows the ring to fill safely. - The new code avoids some hoop-jumping. Currently, a tx dma-map can map an entire tx ring. Therefore an mbuf that fits a dma-map may yet not fit into the tx ring's remaining space. To be sure it can, we must in general count the mbuf's fragments and, if necessary, defrag it and reload the dmamap. The new code limits the dmamap to cover at most 8 tx descriptors (= 15 fragments): now, if an mbuf fits a dma-map it will fit any ring with at least 8 free descriptors. So we need only check for 8 free descriptors and are longer obliged to count fragments and jump hoops. The cost is unused tx ring descriptors, at most 7 of 63, when a one-fragment mbuf occupies one descriptor in the last block of 8. - For simplicity on error return, shift responsibilty for calling m_freem() to rt2860_tx()'s caller (which already calls ieee80211_release_node()).
ral(4) leaks mbufs instead of setting oactive
>Synopsis: ral(4) leaks mbufs instead of setting oactive >Category: kernel >Environment: System : OpenBSD 5.9 Details : OpenBSD 5.9-beta (GENERIC) #1538: Sun Jan 31 20:01:47 MST 2016 dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC >Description: An error in ral(4)'s mbuf defrag logic drops those mbufs that would otherwise fully fill the tx-ring: the oactive flag is never raised and many unnecessary oerrors occur. See patch comments below for details. Demonstrated with an RT2860, but dev/ic/rt2860.c also drives the newer chipsets and this problem should affect them, too. I haven't checked the drivers for the older chipsets. >How-To-Repeat: netcat /dev/zero over a ral(4) interface, or otherwise saturate the link. Observe 'netstat -i' oerror counters and/or SACK episodes. >Fix: Patch below tested for two days on two routers that provide an AP, one an i386 and the other socppc. Patch also stress-tested with netcat, etc: this saw oerrors drop from 1 per 224 frames to 0 over the test run, SACK episodes much reduced, etc. No bandwidth regressions observed. Note that the fix exposed three pre-existing problems with ring-full handling; two of these are fixed by the first two patches below. The third problem is probably only theoretical, was unobserved in testing and looks tricky to remove, see XXX comment. Lastly, the patch adds some KASSERTs so it may pay to test it over the next release cycle. I could replace these with workaround code if desired. best, Richard. PS. A style question: is there a preference between 'if (foo() != 0)' and 'if (foo())' when checking error returns? ral0 at pci0 dev 12 function 0 "Ralink RT2860" rev 0x00: irq 9, address 00:0e:8e:xx:xx:xx ral0: MAC/BBP RT2860 (rev 0x0103), RF RT2850 (MIMO 2T3R) ral0 at pci0 dev 15 function 0 "Ralink RT2860" rev 0x00: ivec 19, address 00:0e:8e:xx:xx:xx ral0: MAC/BBP RT2860 (rev 0x0103), RF RT2850 (MIMO 2T3R) -- * fix dropped frames, watchdog timeouts - ring->queued is shared with the interrupt handler; update it atomically. Once exposed, this problem manifested in testing. Index: sys/dev/ic/rt2860.c === --- sys.orig/dev/ic/rt2860.c +++ sys/dev/ic/rt2860.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -1166,7 +1167,7 @@ rt2860_tx_intr(struct rt2860_softc *sc, ifp->if_opackets++; } - ring->queued--; + atomic_dec_int(>queued); ring->next = (ring->next + 1) % RT2860_TX_RING_COUNT; } @@ -1713,7 +1714,7 @@ rt2860_tx(struct rt2860_softc *sc, struc qid, txwi->wcid, data->map->dm_nsegs, ridx)); ring->cur = (ring->cur + 1) % RT2860_TX_RING_COUNT; - ring->queued += ntxds; + atomic_add_int(>queued, ntxds); if (ring->queued >= RT2860_TX_RING_COUNT) sc->qfullmsk |= 1 << qid; Index: sys/dev/ic/rt2860var.h === --- sys.orig/dev/ic/rt2860var.h +++ sys/dev/ic/rt2860var.h @@ -78,7 +78,7 @@ struct rt2860_tx_ring { struct rt2860_tx_data *data[RT2860_TX_RING_COUNT]; int cur; int next; - int queued; + unsigned intqueued; }; struct rt2860_rx_data { -- * fix dropped frames, watchdog timeouts. - on full tx ring, ring->cur wraps to an active tx descriptor. Passing that wrapped value to the card was observed to cause general flakiness. Fix prevents the wrap at the cost of reducing usable tx descriptors by one. Note: this issue is independent of the ring->queued race; once exposed, neither this nor that patch alone suffices to prevent dropped frames. Index: sys/dev/ic/rt2860.c === --- sys.orig/dev/ic/rt2860.c +++ sys/dev/ic/rt2860.c @@ -1172,7 +1172,7 @@ rt2860_tx_intr(struct rt2860_softc *sc, } sc->sc_tx_timer = 0; - if (ring->queued < RT2860_TX_RING_COUNT) + if (ring->queued < RT2860_TX_RING_MAX) sc->qfullmsk &= ~(1 << qid); ifq_clr_oactive(>if_snd); rt2860_start(ifp); @@ -1619,7 +1619,7 @@ rt2860_tx(struct rt2860_softc *sc, struc /* determine how many TXDs are required */ ntxds = 1 + (data->map->dm_nsegs / 2); - if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) { + if (ring->queued + ntxds >= RT2860_TX_RING_MAX) { /* not enough free TXDs, force mbuf defrag */ bus_dmamap_unload(sc->sc_dmat, d