Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
> On Nov 23 Roland Dreir sent a patch for interrupt handling, but it > doesn't apply on -current since the file rt2661.c changed slightly > a few weeks earlier (1.51, date: 2009/11/01). two "e"s in "Dreier" :) > This patch just changes Roland's patch to update against rt2661.c > r1.51 from the OpenBSD repository instead of Roland's patch which > is against his private GIT repo. Sorry about that... I was testing on a 4.5 box, so even though I had the patch against -current, I sent the wrong (backported) one. > I've been running with this for just over a day, including some > time copying kernels and snaps both ways non-stop (after removing > the ifconfig down/up from crontab). It has locked up only twice in > 24 hrs, a definite improvement. Thanks for testing and keeping this patch alive. I would like to see this comitted since I have multiple reports of this improving stability for people, and also I think that it is pretty clearly correct on a theoretical level too. However I have not seen any response from damien@ unfortunately. In my setup (slow VIA mini-itx box used as an AP) I've not seen any lockups with the patch applied. Could you give a quick description of your setup? Are the lockups you see the same as before -- ie the interface stops with "OACTIVE" set, and recovers if you do if config up/down? (Is that the problem you were having before?) I sent another patch (http://www.mail-archive.com/t...@openbsd.org/msg01261.html) that helps my setup a little more (avoids the "interface stays up but no longer sends broadcasts or multicasts" problem I saw -- not sure why exactly but avoiding sending the adapter garbage descriptors seems like a good idea in any case). You could try with that too and see if it helps at all. I do still see another problem that I have not figured out yet, namely the ral interface on the AP stops sending for 20 or 30 seconds and then recovers by itself. If I run ping to the AP on a client box and leave something like "tcpdump -i ral0 -n icmp" running on the AP, then I see that requests continue to be received during the interruption, but no replies are sent. Also I can see that OACTIVE is not set during the interruption. But I don't know why this is happening yet. Is it possible that this is what you're hitting too? Thanks, Roland
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
Ian Darwin schrieb: On Nov 23 Roland Dreir sent a patch for interrupt handling, but it doesn't apply on -current since the file rt2661.c changed slightly a few weeks earlier (1.51, date: 2009/11/01). This patch just changes Roland's patch to update against rt2661.c r1.51 from the OpenBSD repository instead of Roland's patch which is against his private GIT repo. I've been running with this for just over a day, including some time copying kernels and snaps both ways non-stop (after removing the ifconfig down/up from crontab). It has locked up only twice in 24 hrs, a definite improvement. I am running the old patch applied to 4.6-stable for a few weeks on my soekris with a mini-PCI ral card and SWMBO copying large files over wifi. No new problems so far, the connection seems to be more stable than before. Kind regards, Markus
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
On Nov 23 Roland Dreir sent a patch for interrupt handling, but it doesn't apply on -current since the file rt2661.c changed slightly a few weeks earlier (1.51, date: 2009/11/01). This patch just changes Roland's patch to update against rt2661.c r1.51 from the OpenBSD repository instead of Roland's patch which is against his private GIT repo. I've been running with this for just over a day, including some time copying kernels and snaps both ways non-stop (after removing the ifconfig down/up from crontab). It has locked up only twice in 24 hrs, a definite improvement. Index: rt2661.c === RCS file: /cvs/src/sys/dev/ic/rt2661.c,v retrieving revision 1.51 diff -N -u -p rt2661.c --- rt2661.c1 Nov 2009 12:08:36 - 1.51 +++ rt2661.c28 Dec 2009 21:16:06 - @@ -97,9 +97,8 @@ void rt2661_newassoc(struct ieee80211com *, struct ie intrt2661_newstate(struct ieee80211com *, enum ieee80211_state, int); uint16_t rt2661_eeprom_read(struct rt2661_softc *, uint8_t); +void rt2661_free_tx_desc(struct rt2661_softc *, struct rt2661_tx_ring *); void rt2661_tx_intr(struct rt2661_softc *); -void rt2661_tx_dma_intr(struct rt2661_softc *, - struct rt2661_tx_ring *); void rt2661_rx_intr(struct rt2661_softc *); #ifndef IEEE80211_STA_ONLY void rt2661_mcu_beacon_expire(struct rt2661_softc *); @@ -115,7 +114,7 @@ uint16_trt2661_txtime(int, int, uint32_t); uint8_trt2661_plcp_signal(int); void rt2661_setup_tx_desc(struct rt2661_softc *, struct rt2661_tx_desc *, uint32_t, uint16_t, int, int, - const bus_dma_segment_t *, int, int); + const bus_dma_segment_t *, int, int, int); intrt2661_tx_mgt(struct rt2661_softc *, struct mbuf *, struct ieee80211_node *); intrt2661_tx_data(struct rt2661_softc *, struct mbuf *, @@ -376,7 +375,7 @@ rt2661_alloc_tx_ring(struct rt2661_softc *sc, struct r ring->count = count; ring->queued = 0; - ring->cur = ring->next = ring->stat = 0; + ring->cur = ring->stat = 0; error = bus_dmamap_create(sc->sc_dmat, count * RT2661_TX_DESC_SIZE, 1, count * RT2661_TX_DESC_SIZE, 0, BUS_DMA_NOWAIT, &ring->map); @@ -470,7 +469,7 @@ rt2661_reset_tx_ring(struct rt2661_softc *sc, struct r BUS_DMASYNC_PREWRITE); ring->queued = 0; - ring->cur = ring->next = ring->stat = 0; + ring->cur = ring->stat = 0; } void @@ -881,6 +880,36 @@ rt2661_eeprom_read(struct rt2661_softc *sc, uint8_t ad } void +rt2661_free_tx_desc(struct rt2661_softc *sc, struct rt2661_tx_ring *txq) +{ + struct rt2661_tx_desc *desc = &txq->desc[txq->stat]; + struct rt2661_tx_data *data = &txq->data[txq->stat]; + struct ieee80211com *ic = &sc->sc_ic; + + bus_dmamap_sync(sc->sc_dmat, data->map, 0, + data->map->dm_mapsize, BUS_DMASYNC_POSTWRITE); + bus_dmamap_unload(sc->sc_dmat, data->map); + m_freem(data->m); + data->m = NULL; + + /* descriptor is no longer valid */ + desc->flags &= ~htole32(RT2661_TX_VALID); + + bus_dmamap_sync(sc->sc_dmat, txq->map, + txq->stat * RT2661_TX_DESC_SIZE, RT2661_TX_DESC_SIZE, + BUS_DMASYNC_PREWRITE); + + if (data->ni) { + ieee80211_release_node(ic, data->ni); + data->ni = NULL; + } + + txq->queued--; + if (++txq->stat >= txq->count) /* faster than % count */ + txq->stat = 0; +} + +void rt2661_tx_intr(struct rt2661_softc *sc) { struct ieee80211com *ic = &sc->sc_ic; @@ -888,7 +917,7 @@ rt2661_tx_intr(struct rt2661_softc *sc) struct rt2661_tx_ring *txq; struct rt2661_tx_data *data; struct rt2661_node *rn; - int qid, retrycnt; + int qid, ind, retrycnt; for (;;) { const uint32_t val = RAL_READ(sc, RT2661_STA_CSR4); @@ -898,7 +927,14 @@ rt2661_tx_intr(struct rt2661_softc *sc) /* retrieve the queue in which this frame was sent */ qid = RT2661_TX_QID(val); txq = (qid <= 3) ? &sc->txq[qid] : &sc->mgtq; + ind = RT2661_TX_INDEX(val); + if (txq->stat != ind) + DPRINTFN(10, ("missed TX interrupt, catching up " +"stat %d to index %d\n", txq->stat, ind, qid)); + while (txq->stat != ind) + rt2661_free_tx_desc(sc, txq); + /* retrieve rate control algorithm context */ data = &txq->data[txq->stat]; rn = (struct rt2661_node *)data->ni; @@ -934,14 +970,9 @@ rt2661_tx_intr(struct rt2661_softc *sc) ifp->if_oerrors++; } -
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
I've applied Roland's patch and it works for my ral(4) device: l0 at pci0 dev 14 function 0 "Ralink RT2661" rev 0x00: irq 10, address 00:14:85:d5:39:bb ral0: MAC/BBP RT2661D, RF RT2529 (MIMO XR) On my Soekris Net5501, ral0 in ap mode would hang after a while and be useless. Now it happily passes packets back and forth. The only issues I have are this: 1) Wireless clients seem to regularly, randomly drop out for 20-25 seconds. However, they do reconnect. This is annoying when I am ssh'd into the firewall itself as it breaks my ssh connection when it happens. I've seen messages like these: ral0: station 00:90:39:bb:00:90 disassociate (reason 7) ral0: sending disassoc to 00:90:39:bb:00:90 on channel 7 mode 11g I don't know how else to debug these. The clients happen to be Linux machines. There seems no way to turn off their powersave mode with iwconfig. (Two of the clients have rt2860 chips, one has an atheros chip) I'm not sure if it is powersave in this case or not. 2) ARP packets not being passed between clients. I am able to fix this by running 'ifconfig ral0 -nwflag nobridge', which resets ral0 and then the clients can see each other again.. until ral0 decides to reset again. 3) I am seeing Ierrs and Oerrs on the ral0 interface: NameMtu Network Address Ipkts IerrsOpkts Oerrs Colls ral0150000:14:85:d5:39:bb 2171582 25832 3508310 2026 0 I'd love to test any further patches to fix the broadcast/multicast issues. The patch actually makes my RT2661 card usable again which is great! Tom
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
> Mind sharing your hostname.ral0 and the tools you use to trigger this > situation? I've tried hping, tcpbench, ping -f, rsync, etc to no avail. > > max ~8000 intr/s with hping > 2.5MB/s with scp hostname.ral0 is: inet 10.2.0.1 255.255.0.0 NONE \ mode 11g \ mediaopt hostap \ nwid \ wpa \ wpaprotos wpa2 \ wpapsk 0x \ wpaakms psk \ chan 1 inet6 alias 2001:470:8379:2::1 and this system is basically my home wireless AP -- so it's routing between wired ethernet hooked up to my cable modem and my laptops etc. I see the interface get stuck intermittently under pretty much any heavy traffic from my laptop -- rsync over ssh to a system on wired ethernet, uploading big files to the external internet, etc. I think maybe having a lot of small ack packets to send exposes the race the best, since typically I see the problem when I am sending a lot via TCP from the laptop through the slow AP. If you search the web for soekris and rt2661 then you can find several other people that seem to be hitting this bug from many months ago, which makes sense -- a geode is probably a slow enough CPU to make the races bigger. > cpu0: AMD Athlon(tm) XP 2500+ ("AuthenticAMD" 686-class, 512KB L2 cache) > 1.84 GHz Your CPU may be too fast... my system has: cpu0: VIA Samuel 2 ("CentaurHauls" 686-class) 602 MHz If your system can service TX interrupts fast enough that there is never more than one packet being completed, the standard driver should work fine. - R.
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
On Sun, Nov 22, 2009 at 08:31:07PM -0800, Roland Dreier wrote: > The interrupt handling in ral(4) for RT2661 has a couple of problems, > which causes the interface to get stuck under heavy load with OACTIVE > set (the problems are likely especially severe on slow systems such as > my 600MHz VIA system); bouncing the interface down and back up fixes > things. As I describe below, I think I've been able to fix it, and > I'd be happy to see the patch below reviewed and applied. > > I've seen other reports that look similar to the problems I was > having; eg bug kernel/5958 starts out talking about RT2860 (which is > completely different code) but some of the "me too" replies are for > RT2561S, which I hope this patch fixes (I've cc'ed those reporters; > test reports welcome!). I've not looked at the RT2860 code due to > lack of hardware, but if someone wants to send me a PCI card I've found an unused RT 2561 and did some tests with it. > > The first problem is that multiple TX completions may happen before > the interrupt handler gets to rt2661_tx_intr(). When this happens, > the TX interrupt handler only completes one entry in the TX ring, > which leads to the driver getting behind the hardware. To fix this, I > extended the qid field in the TX descriptor to contain the index in > the TX ring as well as the queue ID, and then when an interrupt is > missed, free the earlier TX entries as well as the entry that the > interrupt is for. (I did see this code trigger under load) > > This exposes the second problem: there is a race that is inherent in > separating TX completion handling between TX DMA interrupts and TX > interrupts -- the driver may handle all the TX DMAs that finished when > it called rt2661_tx_dma_intr(), but by the time it gets to > rt2661_tx_intr(), another TX may have completed and the driver may end > up processing a TX completion for which it hasn't handled the TX DMA > completion. This ends up leaking mbufs if a new send is enqueued > before the TX DMA interrupt has a chance to "catch up." (This happens > in practice on my system as well) > > It is probably possible to fix this and keep the split DMA/TX > handling, but that seems to require unneeded complexity. Instead, we > can just ignore TX DMA interrupts and handle everything when the TX > actually completes. This means we don't free the mbuf quite as soon, > but since we can't reuse the slot in the TX ring anyway, I don't see > this as a problem in practice. > > With this patch applied, the ral interface on my access point is able > to continue operating under load that would cause the interface to get > stuck with the stock driver fairly quickly. I don't see any difference between your patch and -current (but it does work, no issues) Mind sharing your hostname.ral0 and the tools you use to trigger this situation? I've tried hping, tcpbench, ping -f, rsync, etc to no avail. max ~8000 intr/s with hping 2.5MB/s with scp OpenBSD 4.6-current (GENERIC) #0: Sat Dec 5 16:13:19 CET 2009 tobi...@neodym.tmux.org:/home/tobiasu/obsd/src/sys/arch/i386/compile/GENERIC cpu0: AMD Athlon(tm) XP 2500+ ("AuthenticAMD" 686-class, 512KB L2 cache) 1.84 GHz cpu0: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE real mem = 1610117120 (1535MB) avail mem = 1551433728 (1479MB) mainbus0 at root bios0 at mainbus0: AT/286+ BIOS, date 05/17/05, BIOS32 rev. 0 @ 0xfa390, SMBIOS rev. 2.3 @ 0xf0100 (38 entries) bios0: vendor Award Software International, Inc. version "F6" date 05/17/2005 bios0: Gigabyte Technology Co., Ltd. GA-7S748 apm0 at bios0: Power Management spec V1.2 (slowidle) apm0: AC on, battery charge unknown acpi at bios0 function 0x0 not configured pcibios0 at bios0: rev 2.1 @ 0xf/0xc784 pcibios0: PCI IRQ Routing Table rev 1.0 @ 0xfc6f0/144 (7 entries) pcibios0: PCI Exclusive IRQs: 5 6 9 10 11 pcibios0: PCI Interrupt Router at 000:02:0 ("SiS 85C503 System" rev 0x00) pcibios0: PCI bus #1 is the last bus bios0: ROM list: 0xc/0xf600 0xd/0x8000! cpu0 at mainbus0: (uniprocessor) pci0 at mainbus0 bus 0: configuration mode 1 (bios) pchb0 at pci0 dev 0 function 0 "SiS 746 PCI" rev 0x10 sisagp0 at pchb0 agp0 at sisagp0: aperture at 0xe000, size 0x400 ppb0 at pci0 dev 1 function 0 "SiS 86C202 VGA" rev 0x00 pci1 at ppb0 bus 1 vga1 at pci1 dev 0 function 0 vendor "ATI", unknown product 0x9505 rev 0x00 wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation) wsdisplay0: screen 1-5 added (80x25, vt100 emulation) pcib0 at pci0 dev 2 function 0 "SiS 85C503 System" rev 0x25 pciide0 at pci0 dev 2 function 5 "SiS 5513 EIDE" rev 0x00: 746: DMA, channel 0 wired to compatibility, channel 1 wired to compatibility atapiscsi0 at pciide0 channel 0 drive 1 scsibus0 at atapiscsi0: 2 targets cd0 at scsibus0 targ 0 lun 0: ATAPI 5/cdrom removable cd0(pciide0:0:1): using PIO mode 4, DMA mode 2 wd0 at pciide0 channel 1 drive 0: wd0: 16-sector PIO, LBA48, 238474MB, 488395055 sectors wd1 at pciide
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
> Does it do anything for 2860? I have that as an AP now and every once in > a while it stops working, I need to restart the interface. No, the driver code is a completely different C file. It's possible there are analogous bugs for 2860 though, since the hardware and driver are both closely related to 2661. - R.
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
On 2009/11/30 20:33, viq wrote: > On Mon, Nov 30, 2009 at 10:56:29AM -0800, Roland Dreier wrote: > > Hi Damien / OpenBSD devs, > > > > Did anyone get a chance to look at this diff? These fixes are the > > difference for me between ral being usable as an AP and getting stuck > > almost immediately under heavy load. Is there anything I need to do > > to get this committed? > > > > Thanks, > > Roland > > Does it do anything for 2860? I have that as an AP now and every once in > a while it stops working, I need to restart the interface. It doesn't, it is separate code.
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
On Mon, Nov 30, 2009 at 10:56:29AM -0800, Roland Dreier wrote: > Hi Damien / OpenBSD devs, > > Did anyone get a chance to look at this diff? These fixes are the > difference for me between ral being usable as an AP and getting stuck > almost immediately under heavy load. Is there anything I need to do > to get this committed? > > Thanks, > Roland Does it do anything for 2860? I have that as an AP now and every once in a while it stops working, I need to restart the interface. -- viq [demime 1.01d removed an attachment of type application/pgp-signature]
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
Hi Damien / OpenBSD devs, Did anyone get a chance to look at this diff? These fixes are the difference for me between ral being usable as an AP and getting stuck almost immediately under heavy load. Is there anything I need to do to get this committed? Thanks, Roland
strange multicast send bug with ral(4) (was: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load)
By the way, I forgot to mention that even with this patch applied, I do have one odd problem with ral on my system -- after some time (hours it appears), the ral interface stops being able to send multicasts/broadcasts. All other traffic works fine, including receiving multicasts, but no multicasts go out. I only noticed this because my access point is running rtadvd for IPv6, and the clients stop receiving route advertisements. Bouncing the interface with "ifconfig ral0 down; ifconfig ral0 up" fixes this for a few more hours. I've not gotten very far debugging this yet, so I don't even know yet if the multicasts are making it to the driver or are getting lost higher in the stack. But maybe someone has seen this and has some idea of what's going on? (FWIW, this system is still running OpenBSD 4.5 with my patch applied, so possibly this is even something that was already fixed) Thanks, Roland
[PATCH] Fix interrupt handling in ral(4) for RT2661 under load
The interrupt handling in ral(4) for RT2661 has a couple of problems, which causes the interface to get stuck under heavy load with OACTIVE set (the problems are likely especially severe on slow systems such as my 600MHz VIA system); bouncing the interface down and back up fixes things. As I describe below, I think I've been able to fix it, and I'd be happy to see the patch below reviewed and applied. I've seen other reports that look similar to the problems I was having; eg bug kernel/5958 starts out talking about RT2860 (which is completely different code) but some of the "me too" replies are for RT2561S, which I hope this patch fixes (I've cc'ed those reporters; test reports welcome!). I've not looked at the RT2860 code due to lack of hardware, but if someone wants to send me a PCI card The first problem is that multiple TX completions may happen before the interrupt handler gets to rt2661_tx_intr(). When this happens, the TX interrupt handler only completes one entry in the TX ring, which leads to the driver getting behind the hardware. To fix this, I extended the qid field in the TX descriptor to contain the index in the TX ring as well as the queue ID, and then when an interrupt is missed, free the earlier TX entries as well as the entry that the interrupt is for. (I did see this code trigger under load) This exposes the second problem: there is a race that is inherent in separating TX completion handling between TX DMA interrupts and TX interrupts -- the driver may handle all the TX DMAs that finished when it called rt2661_tx_dma_intr(), but by the time it gets to rt2661_tx_intr(), another TX may have completed and the driver may end up processing a TX completion for which it hasn't handled the TX DMA completion. This ends up leaking mbufs if a new send is enqueued before the TX DMA interrupt has a chance to "catch up." (This happens in practice on my system as well) It is probably possible to fix this and keep the split DMA/TX handling, but that seems to require unneeded complexity. Instead, we can just ignore TX DMA interrupts and handle everything when the TX actually completes. This means we don't free the mbuf quite as soon, but since we can't reuse the slot in the TX ring anyway, I don't see this as a problem in practice. With this patch applied, the ral interface on my access point is able to continue operating under load that would cause the interface to get stuck with the stock driver fairly quickly. --- rt2661.c| 118 -- rt2661reg.h |3 +- rt2661var.h |1 - 3 files changed, 51 insertions(+), 71 deletions(-) diff --git a/rt2661.c b/rt2661.c index f838969..9a9cc53 100644 --- a/rt2661.c +++ b/rt2661.c @@ -97,9 +97,8 @@ void rt2661_newassoc(struct ieee80211com *, struct ieee80211_node *, intrt2661_newstate(struct ieee80211com *, enum ieee80211_state, int); uint16_t rt2661_eeprom_read(struct rt2661_softc *, uint8_t); +void rt2661_free_tx_desc(struct rt2661_softc *, struct rt2661_tx_ring *); void rt2661_tx_intr(struct rt2661_softc *); -void rt2661_tx_dma_intr(struct rt2661_softc *, - struct rt2661_tx_ring *); void rt2661_rx_intr(struct rt2661_softc *); #ifndef IEEE80211_STA_ONLY void rt2661_mcu_beacon_expire(struct rt2661_softc *); @@ -115,7 +114,7 @@ uint16_trt2661_txtime(int, int, uint32_t); uint8_trt2661_plcp_signal(int); void rt2661_setup_tx_desc(struct rt2661_softc *, struct rt2661_tx_desc *, uint32_t, uint16_t, int, int, - const bus_dma_segment_t *, int, int); + const bus_dma_segment_t *, int, int, int); intrt2661_tx_mgt(struct rt2661_softc *, struct mbuf *, struct ieee80211_node *); intrt2661_tx_data(struct rt2661_softc *, struct mbuf *, @@ -376,7 +375,7 @@ rt2661_alloc_tx_ring(struct rt2661_softc *sc, struct rt2661_tx_ring *ring, ring->count = count; ring->queued = 0; - ring->cur = ring->next = ring->stat = 0; + ring->cur = ring->stat = 0; error = bus_dmamap_create(sc->sc_dmat, count * RT2661_TX_DESC_SIZE, 1, count * RT2661_TX_DESC_SIZE, 0, BUS_DMA_NOWAIT, &ring->map); @@ -470,7 +469,7 @@ rt2661_reset_tx_ring(struct rt2661_softc *sc, struct rt2661_tx_ring *ring) BUS_DMASYNC_PREWRITE); ring->queued = 0; - ring->cur = ring->next = ring->stat = 0; + ring->cur = ring->stat = 0; } void @@ -881,6 +880,36 @@ rt2661_eeprom_read(struct rt2661_softc *sc, uint8_t addr) } void +rt2661_free_tx_desc(struct rt2661_softc *sc, struct rt2661_tx_ring *txq) +{ + struct rt2661_tx_desc *desc = &txq->desc[txq->stat]; + struct rt2661_tx_data *data = &txq->data[txq->stat]; + struct ieee80211com *ic = &sc->sc_ic; + + bus_dmamap_sync(sc->sc_dmat, data->map,