Re: ral(4) leaks mbufs instead of setting oactive

2016-02-29 Thread Richard Procter

On Thu, 11 Feb 2016, Richard Procter wrote:
> [...] I found a benign off-by-one error in [my] code that left a 
> tx descriptor unused. [...]
>
> I've not tested this tweak as I'm unable to access the hardware for a few 
> weeks, unfortunately.
 
Now tested, via concurrent 

$ cat /dev/zero | nc sink sink-port
# tcpdump -i ral0 -n not port 22   

on the ral(4) machine over a ssh session. This fills the tx ring on my 
alix 2d2.

Updated patch below fixes a comment and removes the second KASSERT. 

best,
Richard.

ral0 at pci0 dev 12 function 0 "Ralink RT2860" rev 0x00: irq 9, address 
xx:xx:xx:xx:xx:xx
ral0: MAC/BBP RT2860 (rev 0x0103), RF RT2850 (MIMO 2T3R)


* fix watchdog timeouts and dropped frames under load. 

- on full tx ring, ring->cur wraps to an active tx descriptor. Passing 
that wrapped value to the card was observed to cause general flakiness.

Fix prevents the wrap at the cost of reducing usable tx descriptors by 
one.

Index: sys/dev/ic/rt2860.c
===
--- sys.orig/dev/ic/rt2860.c
+++ sys/dev/ic/rt2860.c
@@ -1171,7 +1171,7 @@ rt2860_tx_intr(struct rt2860_softc *sc,
}
 
sc->sc_tx_timer = 0;
-   if (ring->queued < RT2860_TX_RING_COUNT)
+   if (ring->queued < RT2860_TX_RING_MAX)
sc->qfullmsk &= ~(1 << qid);
ifq_clr_oactive(>if_snd);
rt2860_start(ifp);
@@ -1618,7 +1618,7 @@ rt2860_tx(struct rt2860_softc *sc, struc
/* determine how many TXDs are required */
ntxds = 1 + (data->map->dm_nsegs / 2);
 
-   if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) {
+   if (ring->queued + ntxds >= RT2860_TX_RING_MAX) {
/* not enough free TXDs, force mbuf defrag */
bus_dmamap_unload(sc->sc_dmat, data->map);
error = EFBIG;
@@ -1656,7 +1656,7 @@ rt2860_tx(struct rt2860_softc *sc, struc
/* determine how many TXDs are now required */
ntxds = 1 + (data->map->dm_nsegs / 2);
 
-   if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) {
+   if (ring->queued + ntxds >= RT2860_TX_RING_MAX) {
/* this is a hopeless case, drop the mbuf! */
bus_dmamap_unload(sc->sc_dmat, data->map);
m_freem(m);
@@ -1714,7 +1714,7 @@ rt2860_tx(struct rt2860_softc *sc, struc
 
ring->cur = (ring->cur + 1) % RT2860_TX_RING_COUNT;
ring->queued += ntxds;
-   if (ring->queued >= RT2860_TX_RING_COUNT)
+   if (ring->queued >= RT2860_TX_RING_MAX)
sc->qfullmsk |= 1 << qid;
 
/* kick Tx */
Index: sys/dev/ic/rt2860var.h
===
--- sys.orig/dev/ic/rt2860var.h
+++ sys/dev/ic/rt2860var.h
@@ -17,8 +17,9 @@
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */
 
-#define RT2860_TX_RING_COUNT   64
 #define RT2860_RX_RING_COUNT   128
+#define RT2860_TX_RING_COUNT   64
+#define RT2860_TX_RING_MAX (RT2860_TX_RING_COUNT - 1)
 #define RT2860_TX_POOL_COUNT   (RT2860_TX_RING_COUNT * 2)
 
 #define RT2860_MAX_SCATTER ((RT2860_TX_RING_COUNT * 2) - 1)

* replace custom defrag with m_defrag() 

- This fixes an error in the existing code: the "hopeless case" guard
equivales 'ring now full', so oactive is never set: the code drops any mbuf
that would fill the ring. This occurs often in practice.

- The preceding patch allows the ring to fill safely. 

- The new code avoids some hoop-jumping. Currently, one tx dma-map can fill
the tx ring. Therefore an mbuf that fits a dma-map may yet not fit into the tx
ring's remaining space.  To be sure it can, we must in general count the
mbuf's fragments and, if necessary, defrag it and reload the dmamap.

The new code limits the dmamap to cover at most 8 tx descriptors (= 15
fragments): now, if an mbuf fits a dma-map it will fit any ring with at least
8 free descriptors. So we need only check for 8 free descriptors and are
longer obliged to count fragments and jump hoops. The cost is unused tx
ring descriptors, at most 7 of 63, when a one-fragment mbuf occupies one
descriptor in the last block of 8.

- For simplicity on error return, shift responsibilty for calling
m_freem() to rt2860_tx()'s caller (which already calls
ieee80211_release_node()).

Index: sys/dev/ic/rt2860.c
===
--- sys.orig/dev/ic/rt2860.c
+++ sys/dev/ic/rt2860.c
@@ -567,7 +567,7 @@ rt2860_alloc_tx_pool(struct rt2860_softc
 
error = bus_dmamap_create(sc->sc_dmat, MCLBYTES,
RT2860_MAX_SCATTER, MCLBYTES, 0, BUS_DMA_NOWAIT,
-   >map);
+   >map); /* <0> */
if (error != 0) {
printf("%s: could not create DMA map\n",
 

Re: ral(4) leaks mbufs instead of setting oactive

2016-02-06 Thread Stefan Sperling
On Wed, Feb 03, 2016 at 04:41:34PM +1300, Richard Procter wrote:
> 
> On Tue, 2 Feb 2016, Stefan Sperling wrote:  
> > > On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote:
> > > -   ring->queued--;
> > > +   atomic_dec_int(>queued);
> > 
> > > -   ring->queued += ntxds;
> > > +   atomic_add_int(>queued, ntxds);
> > 
> > I don't think these make a difference in the current way of things.
> > Wireless drivers run interrupts under the kernel big lock, interrupts
> > aren't preemptible, and AFAIK (most?) 32bit integer operations are atomic.  
> [...]
> > Hmm. Taking a closer look, if_start() is already called under splnet.
> > So adding splnet to rt2860_tx() shouldn't make a difference.
> 
> You're right, the atomic is unnecessary. I'd assumed rt2860_tx() was 
> running under splsoftnet. I could have sworn I'd seen errors without the 
> atomic but that now tests fine, too. (I still see errors without the 
> ring->cur fix.)
> 
> > This also means the card cannot interrupt in the way your comment
> > describes, i.e. the problem you're "fixing" here cannot exist... ?
> 
> Also right.
> 
> This simplifies things --- see below for the patch minus the above.  
> Without it my card under stress sees 1 oerror per ~217 packets; it's 
> now sent 5E6 without seeing any.
> 
> cheers, 
> Richard. 

I think these changes are good. I'm testing them with 2 different
rt2860 style ral(4) cards. I don't have the nice model you've got,
though :-(  Mine are 2GHz only which makes it a bit difficult to
push packets out fast. There's a lot of noise from other networks.

One more thing I'd like to ask: Could you transform the following
paragraph you wrote into a comment somewhere in the code?

> The new code limits the dmamap to cover at most 8 tx descriptors (= 15
> fragments): now, if an mbuf fits a dma-map it will fit any ring with at least
> 8 free descriptors. So we need only check for 8 free descriptors and are
> longer obliged to count fragments and jump hoops. The cost is unused tx
> ring descriptors, at most 7 of 63, when a one-fragment mbuf occupies one
> descriptor in the last block of 8.



Re: ral(4) leaks mbufs instead of setting oactive

2016-02-02 Thread Stefan Sperling
On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote:
> - ring->queued--;
> + atomic_dec_int(>queued);

> - ring->queued += ntxds;
> + atomic_add_int(>queued, ntxds);

I don't think these make a difference in the current way of things.
Wireless drivers run interrupts under the kernel big lock, interrupts
aren't preemptible, and AFAIK (most?) 32bit integer operations are atomic.

> * fix dropped frames, watchdog timeouts. 
> 
> - on full tx ring, ring->cur wraps to an active tx descriptor. Passing 
> that wrapped value to the card was observed to cause general flakiness.

Nice find.

> * replace custom defrag with m_defrag() 

Also good. We've already done this in some other drivers.

> + /* XXX card may interrupt here and invalidate this guard; the

You can easily prevent the card from interrupting by making rt2860_tx()
call splnet() before modifying data shared with the interrupt handler.
I think that's the real bug you're looking for.
Could you try that and send an updated diff if it works?

Thank you very much for your work on this. I wrote some fixes for the
rt2560 line of ral chips some years ago fixing similar driver bugs.



Re: ral(4) leaks mbufs instead of setting oactive

2016-02-02 Thread Stefan Sperling
On Tue, Feb 02, 2016 at 09:14:06AM +0100, Stefan Sperling wrote:
> On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote:
> > +   atomic_add_int(>queued, ntxds);
> > +   /* XXX card may interrupt here and invalidate this guard; the
> 
> You can easily prevent the card from interrupting by making rt2860_tx()
> call splnet() before modifying data shared with the interrupt handler.
> I think that's the real bug you're looking for.
> Could you try that and send an updated diff if it works?

Hmm. Taking a closer look, if_start() is already called under splnet.
So adding splnet to rt2860_tx() shouldn't make a difference.

This also means the card cannot interrupt in the way your comment
describes, i.e. the problem you're "fixing" here cannot exist... ?



Re: ral(4) leaks mbufs instead of setting oactive

2016-02-02 Thread Martin Pieuchot
On 02/02/16(Tue) 09:14, Stefan Sperling wrote:
> On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote:
> > -   ring->queued--;
> > +   atomic_dec_int(>queued);
> 
> > -   ring->queued += ntxds;
> > +   atomic_add_int(>queued, ntxds);
> 
> I don't think these make a difference in the current way of things.

It does not, the wireless stack needs some love before wifi drivers can
have their interrupt handlers marked as mp-safe.

> Wireless drivers run interrupts under the kernel big lock, interrupts
> aren't preemptible, and AFAIK (most?) 32bit integer operations are atomic.

Adding/decrementing are not because you need to reading the existing
value then write the modified one :)



Re: ral(4) leaks mbufs instead of setting oactive

2016-02-02 Thread Richard Procter

On Tue, 2 Feb 2016, Stefan Sperling wrote:  
> > On Sat, Jan 30, 2016 at 10:49:38PM +1300, Richard Procter wrote:
> > -   ring->queued--;
> > +   atomic_dec_int(>queued);
> 
> > -   ring->queued += ntxds;
> > +   atomic_add_int(>queued, ntxds);
> 
> I don't think these make a difference in the current way of things.
> Wireless drivers run interrupts under the kernel big lock, interrupts
> aren't preemptible, and AFAIK (most?) 32bit integer operations are atomic.  
[...]
> Hmm. Taking a closer look, if_start() is already called under splnet.
> So adding splnet to rt2860_tx() shouldn't make a difference.

You're right, the atomic is unnecessary. I'd assumed rt2860_tx() was 
running under splsoftnet. I could have sworn I'd seen errors without the 
atomic but that now tests fine, too. (I still see errors without the 
ring->cur fix.)

> This also means the card cannot interrupt in the way your comment
> describes, i.e. the problem you're "fixing" here cannot exist... ?

Also right.

This simplifies things --- see below for the patch minus the above.  
Without it my card under stress sees 1 oerror per ~217 packets; it's 
now sent 5E6 without seeing any.

cheers, 
Richard. 

---
* fix watchdog timeouts and dropped frames under load. 

- on full tx ring, ring->cur wraps to an active tx descriptor. Passing 
that wrapped value to the card was observed to cause general flakiness.

Fix prevents the wrap at the cost of reducing usable tx descriptors by 
one.

Index: sys/dev/ic/rt2860.c
===
--- sys.orig/dev/ic/rt2860.c
+++ sys/dev/ic/rt2860.c
@@ -1171,7 +1171,7 @@ rt2860_tx_intr(struct rt2860_softc *sc,
}
 
sc->sc_tx_timer = 0;
-   if (ring->queued < RT2860_TX_RING_COUNT)
+   if (ring->queued < RT2860_TX_RING_MAX)
sc->qfullmsk &= ~(1 << qid);
ifq_clr_oactive(>if_snd);
rt2860_start(ifp);
@@ -1618,7 +1618,7 @@ rt2860_tx(struct rt2860_softc *sc, struc
/* determine how many TXDs are required */
ntxds = 1 + (data->map->dm_nsegs / 2);
 
-   if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) {
+   if (ring->queued + ntxds >= RT2860_TX_RING_MAX) {
/* not enough free TXDs, force mbuf defrag */
bus_dmamap_unload(sc->sc_dmat, data->map);
error = EFBIG;
@@ -1656,7 +1656,7 @@ rt2860_tx(struct rt2860_softc *sc, struc
/* determine how many TXDs are now required */
ntxds = 1 + (data->map->dm_nsegs / 2);
 
-   if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) {
+   if (ring->queued + ntxds >= RT2860_TX_RING_MAX) {
/* this is a hopeless case, drop the mbuf! */
bus_dmamap_unload(sc->sc_dmat, data->map);
m_freem(m);
@@ -1714,7 +1714,7 @@ rt2860_tx(struct rt2860_softc *sc, struc
 
ring->cur = (ring->cur + 1) % RT2860_TX_RING_COUNT;
ring->queued += ntxds;
-   if (ring->queued >= RT2860_TX_RING_COUNT)
+   if (ring->queued >= RT2860_TX_RING_MAX)
sc->qfullmsk |= 1 << qid;
 
/* kick Tx */
Index: sys/dev/ic/rt2860var.h
===
--- sys.orig/dev/ic/rt2860var.h
+++ sys/dev/ic/rt2860var.h
@@ -17,8 +17,9 @@
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */
 
-#define RT2860_TX_RING_COUNT   64
 #define RT2860_RX_RING_COUNT   128
+#define RT2860_TX_RING_COUNT   64
+#define RT2860_TX_RING_MAX (RT2860_TX_RING_COUNT - 1)
 #define RT2860_TX_POOL_COUNT   (RT2860_TX_RING_COUNT * 2)
 
 #define RT2860_MAX_SCATTER ((RT2860_TX_RING_COUNT * 2) - 1)

---
* replace custom defrag with m_defrag() 

- This fixes an error in the existing code: the "hopeless case" guard
equivales 'ring now full', so oactive is never set: the code drops any mbuf
that would fill the ring. This occurs often in practice.

- The preceding patch allows the ring to fill safely. 

- The new code avoids some hoop-jumping. Currently, a tx dma-map can map 
an entire tx ring. Therefore an mbuf that fits a dma-map may yet not fit 
into the tx ring's remaining space.  To be sure it can, we must in general 
count the mbuf's fragments and, if necessary, defrag it and reload the 
dmamap.

The new code limits the dmamap to cover at most 8 tx descriptors (= 15
fragments): now, if an mbuf fits a dma-map it will fit any ring with at least
8 free descriptors. So we need only check for 8 free descriptors and are
longer obliged to count fragments and jump hoops. The cost is unused tx
ring descriptors, at most 7 of 63, when a one-fragment mbuf occupies one
descriptor in the last block of 8.

- For simplicity on error return, shift responsibilty for calling
m_freem() to rt2860_tx()'s caller (which already calls
ieee80211_release_node()).

ral(4) leaks mbufs instead of setting oactive

2016-02-01 Thread Richard Procter

>Synopsis:      ral(4) leaks mbufs instead of setting oactive 
>Category:  kernel
>Environment:
System  : OpenBSD 5.9
Details : OpenBSD 5.9-beta (GENERIC) #1538: Sun Jan 31 20:01:47 MST 
2016
 
dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC

>Description:
An error in ral(4)'s mbuf defrag logic drops those mbufs that would 
otherwise fully fill the tx-ring: the oactive flag is never raised and 
many unnecessary oerrors occur. See patch comments below for details.

Demonstrated with an RT2860, but dev/ic/rt2860.c also drives the newer 
chipsets and this problem should affect them, too. I haven't checked the 
drivers for the older chipsets.

>How-To-Repeat: 
netcat /dev/zero over a ral(4) interface, or otherwise saturate the link. 
Observe 'netstat -i' oerror counters and/or SACK episodes.

>Fix:
Patch below tested for two days on two routers that provide an AP, one an 
i386 and the other socppc. Patch also stress-tested with netcat, etc: this 
saw oerrors drop from 1 per 224 frames to 0 over the test run, SACK 
episodes much reduced, etc. No bandwidth regressions observed.

Note that the fix exposed three pre-existing problems with ring-full 
handling; two of these are fixed by the first two patches below. The third 
problem is probably only theoretical, was unobserved in testing and looks 
tricky to remove, see XXX comment.

Lastly, the patch adds some KASSERTs so it may pay to test it over the 
next release cycle. I could replace these with workaround code if desired.

best, 
Richard. 

PS. A style question: is there a preference between 'if (foo() != 0)' and  
'if (foo())' when checking error returns?

ral0 at pci0 dev 12 function 0 "Ralink RT2860" rev 0x00: irq 9, address 
00:0e:8e:xx:xx:xx 
ral0: MAC/BBP RT2860 (rev 0x0103), RF RT2850 (MIMO 2T3R) 

ral0 at pci0 dev 15 function 0 "Ralink RT2860" rev 0x00: ivec 19, address 
00:0e:8e:xx:xx:xx
ral0: MAC/BBP RT2860 (rev 0x0103), RF RT2850 (MIMO 2T3R)

--
 * fix dropped frames, watchdog timeouts

- ring->queued is shared with the interrupt handler; update it atomically. 

Once exposed, this problem manifested in testing. 

Index: sys/dev/ic/rt2860.c
===
--- sys.orig/dev/ic/rt2860.c
+++ sys/dev/ic/rt2860.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1166,7 +1167,7 @@ rt2860_tx_intr(struct rt2860_softc *sc,
 
ifp->if_opackets++;
}
-   ring->queued--;
+   atomic_dec_int(>queued);
ring->next = (ring->next + 1) % RT2860_TX_RING_COUNT;
}
 
@@ -1713,7 +1714,7 @@ rt2860_tx(struct rt2860_softc *sc, struc
qid, txwi->wcid, data->map->dm_nsegs, ridx));
 
ring->cur = (ring->cur + 1) % RT2860_TX_RING_COUNT;
-   ring->queued += ntxds;
+   atomic_add_int(>queued, ntxds);
if (ring->queued >= RT2860_TX_RING_COUNT)
sc->qfullmsk |= 1 << qid;
 
Index: sys/dev/ic/rt2860var.h
===
--- sys.orig/dev/ic/rt2860var.h
+++ sys/dev/ic/rt2860var.h
@@ -78,7 +78,7 @@ struct rt2860_tx_ring {
struct rt2860_tx_data   *data[RT2860_TX_RING_COUNT];
int cur;
int next;
-   int queued;
+   unsigned intqueued;
 };
 
 struct rt2860_rx_data {
--

* fix dropped frames, watchdog timeouts. 

- on full tx ring, ring->cur wraps to an active tx descriptor. Passing 
that wrapped value to the card was observed to cause general flakiness.

Fix prevents the wrap at the cost of reducing usable tx descriptors by 
one.

Note: this issue is independent of the ring->queued race; once exposed, 
neither this nor that patch alone suffices to prevent dropped frames.

Index: sys/dev/ic/rt2860.c
===
--- sys.orig/dev/ic/rt2860.c
+++ sys/dev/ic/rt2860.c
@@ -1172,7 +1172,7 @@ rt2860_tx_intr(struct rt2860_softc *sc,
}
 
sc->sc_tx_timer = 0;
-   if (ring->queued < RT2860_TX_RING_COUNT)
+   if (ring->queued < RT2860_TX_RING_MAX)
sc->qfullmsk &= ~(1 << qid);
ifq_clr_oactive(>if_snd);
rt2860_start(ifp);
@@ -1619,7 +1619,7 @@ rt2860_tx(struct rt2860_softc *sc, struc
/* determine how many TXDs are required */
ntxds = 1 + (data->map->dm_nsegs / 2);
 
-   if (ring->queued + ntxds >= RT2860_TX_RING_COUNT) {
+   if (ring->queued + ntxds >= RT2860_TX_RING_MAX) {
/* not enough free TXDs, force mbuf defrag */
bus_dmamap_unload(sc->sc_dmat, d