Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On 11/23/2009 05:00 AM, Francesco Gringoli wrote: > > so you can observe this behavior at your site. Do you mind describing > the exact configuration? Maybe this time I can reproduce this behavior, > as I tried everything to make it happen. I also asked Larry one of his > boards and put it into several PCs but had no chance to reproduce the > crash. Could you please also report the neighboring stations, the AP you > are connected and so on. As Michael said, I was the one that reported the behavior. My card was a BCM4318 in a Cardbus format running V5.2 of the openfwwf. The AP is a Linksys WRT54G V5 running standard firmware v1.02.6. A list of nearby AP's with channels and strengths are as follows: Cell 01 - Address: 00:14:BF:85:49:FA Channel:1 Frequency:2.412 GHz (Channel 1) Quality=70/70 Signal level=-6 dBm Encryption key:on (WPA2) ESSID:"lwfdjf_rad" Cell 02 - Address: 00:1B:11:5C:B0:83 Channel:1 Frequency:2.412 GHz (Channel 1) Quality=25/70 Signal level=-85 dBm Encryption key:on (WEP) ESSID:"Browns" Cell 03 - Address: 00:1A:70:46:BA:B1 Channel:11 Frequency:2.462 GHz (Channel 11) Quality=42/70 Signal level=-68 dBm Encryption key:on(WEP) ESSID:"Larry with space" Cell 04 - Address: 00:23:69:81:B7:D9 Channel:6 Frequency:2.437 GHz (Channel 6) Quality=34/70 Signal level=-76 dBm Encryption key:on(WEP) ESSID:"Hoover" Cell 05 - Address: 00:14:BF:0C:7E:14 Channel:6 Frequency:2.437 GHz (Channel 6) Quality=28/70 Signal level=-82 dBm Encryption key:on (WPA) ESSID:"linksys" Cell 06 - Address: 00:22:6B:78:18:7D Channel:11 Frequency:2.462 GHz (Channel 11) Quality=26/70 Signal level=-84 dBm Encryption key:on (WPA) ESSID:"Go Away" I was connected to the AP in Cell 01. The test was my usual with a repeating tcpperf run in one terminal and a flood ping to the same server in a second. Please let me know if I missed any useful information. This condition may take a long time to show up. For instance, my latest test ran for 25 hours before failure. All other tests have failed much more quickly, but one never knows. I have not seen this failure with standard firmware. Larry Larry ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On Nov 23, 2009, at 11:49 AM, Michael Buesch wrote: > On Monday 23 November 2009 05:45:47 Larry Finger wrote: >> On 11/19/2009 03:24 PM, Michael Buesch wrote: >>> This rewrites the error handling policies in the TX status handler. >>> It tries to be error-tolerant as in "try hard to not crash the >>> machine". >>> It won't recover from errors (that are bugs in the firmware or >>> driver), >>> because that's impossible. However, it will return a more or less >>> useful >>> error message and bail out. It also tries hard to use rate-limited >>> messages >>> to not flood the syslog in case of a failure. >> >> This patch definitely helped open-source firmware, but it is not a >> complete fix. > > It is no fix _at_ _all_. > The patch does not change a single line of code that wasn't either > an assertion > or a machine crash before. > So it just transforms assertions into more verbose assertions and > crashes into > assertions without a crash. > >> debug: Out of order TX status report on DMA ring 1. Expected 114, >> but got 146 > > Ok, this is what I expected. > > Let's see what's going on. Here's the ring. o is unused, * is used. > > ooo > ***ooo > ^ ^ ^ > 114 146 newest > oldest > > So as you can see, the firmware reported a TX status for a frame > right in the middle of > the ringbuffer. The new code detects this now before getting a > double free and/or silent > memory corruption (freeing of used memory). Hi Michael, so you can observe this behavior at your site. Do you mind describing the exact configuration? Maybe this time I can reproduce this behavior, as I tried everything to make it happen. I also asked Larry one of his boards and put it into several PCs but had no chance to reproduce the crash. Could you please also report the neighboring stations, the AP you are connected and so on. Many thanks, -Francesco Informativa sulla privacy: http://help.ing.unibs.it/privacy.php ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On Monday 23 November 2009 12:00:15 Francesco Gringoli wrote: > Hi Michael, > > so you can observe this behavior at your site. Do you mind describing > the exact configuration? Maybe this time I can reproduce this > behavior, as I tried everything to make it happen. I also asked Larry > one of his boards and put it into several PCs but had no chance to > reproduce the crash. Could you please also report the neighboring > stations, the AP you are connected and so on. I don't reproduce this, because I never really tried running the opensource firmware. Larry reproduced it, this time. -- Greetings, Michael. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On Monday 23 November 2009 05:45:47 Larry Finger wrote: > On 11/19/2009 03:24 PM, Michael Buesch wrote: > > This rewrites the error handling policies in the TX status handler. > > It tries to be error-tolerant as in "try hard to not crash the machine". > > It won't recover from errors (that are bugs in the firmware or driver), > > because that's impossible. However, it will return a more or less useful > > error message and bail out. It also tries hard to use rate-limited messages > > to not flood the syslog in case of a failure. > > This patch definitely helped open-source firmware, but it is not a complete > fix. It is no fix _at_ _all_. The patch does not change a single line of code that wasn't either an assertion or a machine crash before. So it just transforms assertions into more verbose assertions and crashes into assertions without a crash. > debug: Out of order TX status report on DMA ring 1. Expected 114, but got 146 Ok, this is what I expected. Let's see what's going on. Here's the ring. o is unused, * is used. ooo***ooo ^ ^ ^ 114 146 newest oldest So as you can see, the firmware reported a TX status for a frame right in the middle of the ringbuffer. The new code detects this now before getting a double free and/or silent memory corruption (freeing of used memory). It really is illegal to report a TX status for a frame that's not the oldest one in the ring. The firmware is required to process all frames in-order on one ring. So how can this failure happen? I think there basically are three ways this can happen. - First is that the ordering within one ring really gets messed up and it loses track of its ring pointers. I'm not sure if this is likely. Probably not. - It messes up the ring membership. So it reports a TX status on the wrong ring. Note that the "ring" kernel pointer in the TX status report handler is derived from the cookie (and so also the number in the message "Out of order TX status report on DMA ring 1" is derived from the cookie). So it's untrustworthy in case of broken firmware. The firmware has QoS-alike mechanisms, even if QoS is disabled. Maybe these mechanisms are broken. - Third is the possibility of a driver bug. I rule that out as long as nobody is able to reproduce it with proprietary firmware. -- Greetings, Michael. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On Monday 23 November 2009 02:34:09 Larry Finger wrote: > On 11/19/2009 03:24 PM, Michael Buesch wrote: > > This rewrites the error handling policies in the TX status handler. > > It tries to be error-tolerant as in "try hard to not crash the machine". > > It won't recover from errors (that are bugs in the firmware or driver), > > because that's impossible. However, it will return a more or less useful > > error message and bail out. It also tries hard to use rate-limited messages > > to not flood the syslog in case of a failure. > > > > Signed-off-by: Michael Buesch > > > > --- > > Tested and ACKed by Larry Finger. Not only does this improve the error > handling > for b43, but it also appears to fix the skb == NULL error that I experienced > with the open-source firmware. I don't think there's any way it can fix this. The patch doesn't change the code behavior. It just changes the sanity checks, that under normal circumstances should never trigger. > John - please push this into wireless-testing. It should also go to 2.6.32, > but > it is likely too large for the current stage. At least Cc it to stable. Don't put it into stable. This is not a fix. I don't think it's suitable for 2.6.32 at this stage, too. -- Greetings, Michael. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On 11/19/2009 03:24 PM, Michael Buesch wrote: > This rewrites the error handling policies in the TX status handler. > It tries to be error-tolerant as in "try hard to not crash the machine". > It won't recover from errors (that are bugs in the firmware or driver), > because that's impossible. However, it will return a more or less useful > error message and bail out. It also tries hard to use rate-limited messages > to not flood the syslog in case of a failure. This patch definitely helped open-source firmware, but it is not a complete fix. Usually, a failure occurs within minutes to a few hours under heavy load. For my system, the load is repeating tcpperf transmits in one terminal, and a flood ping in a second. This time, the system ran for ~25 hours before failing. After removing the header to eliminate wrap-around, the messages logged are: b43 ssb0:0: firmware: requesting b43-open/pcm5.fw Loading OpenSource firmware version 410.31754 Hardware crypto acceleration not supported by firmware QoS not supported by firmware debug: Chip initialized debug: 32-bit DMA initialized debug: QoS disabled debug: Wireless interface started debug: Adding Interface type 2 wlan0: direct probe to AP 00:14:bf:85:49:fa (try 1) wlan0: direct probe responded wlan0: authenticate with AP 00:14:bf:85:49:fa (try 1) wlan0: authenticated wlan0: associate with AP 00:14:bf:85:49:fa (try 1) wlan0: RX ReassocResp from 00:14:bf:85:49:fa (capab=0x411 status=12 aid=0) wlan0: AP denied association (code=12) wlan0: associate with AP 00:14:bf:85:49:fa (try 1) wlan0: RX ReassocResp from 00:14:bf:85:49:fa (capab=0x411 status=12 aid=0) wlan0: AP denied association (code=12) wlan0: deauthenticating from 00:14:bf:85:49:fa by local choice (reason=3) wlan0: direct probe to AP 00:14:bf:85:49:fa (try 1) wlan0: direct probe to AP 00:14:bf:85:49:fa (try 1) wlan0: direct probe responded wlan0: authenticate with AP 00:14:bf:85:49:fa (try 1) wlan0: authenticated wlan0: associate with AP 00:14:bf:85:49:fa (try 1) wlan0: RX ReassocResp from 00:14:bf:85:49:fa (capab=0x411 status=0 aid=2) wlan0: associated spurious 8259A interrupt: IRQ15. SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=00:18:39:5e:90:f9:00:14:bf:85:49:f8:08:00 SRC=192.168.1.1 DST=192.168.1.107 LEN=336 TOS=0x00 PREC=0x00 TTL=64 ID=64124 PROTO=UDP SPT=67 DPT=68 LEN=316 SFW2-INext-DROP-DEFLT IN=wlan0 OUT= MAC=00:18:39:5e:90:f9:00:14:bf:85:49:f8:08:00 SRC=192.168.1.1 DST=192.168.1.107 LEN=336 TOS=0x00 PREC=0x00 TTL=64 ID=6664 PROTO=UDP SPT=67 DPT=68 LEN=316 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 146 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 148 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 118 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 150 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 152 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 154 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 156 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 158 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 160 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 162 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 164 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 134 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 166 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 168 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 170 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 172 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 174 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 176 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 178 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 180 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 182 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 184 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 154 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 186 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 156 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 188 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 158 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 190 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 160 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 192 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 162 debug: Out of order TX status report on DMA ring 1. Expected 114, but got 19
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On 11/19/2009 03:24 PM, Michael Buesch wrote: > This rewrites the error handling policies in the TX status handler. > It tries to be error-tolerant as in "try hard to not crash the machine". > It won't recover from errors (that are bugs in the firmware or driver), > because that's impossible. However, it will return a more or less useful > error message and bail out. It also tries hard to use rate-limited messages > to not flood the syslog in case of a failure. > > Signed-off-by: Michael Buesch > > --- Tested and ACKed by Larry Finger. Not only does this improve the error handling for b43, but it also appears to fix the skb == NULL error that I experienced with the open-source firmware. John - please push this into wireless-testing. It should also go to 2.6.32, but it is likely too large for the current stage. At least Cc it to stable. Larry ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On Sunday 22 November 2009 19:11:52 Larry Finger wrote: > On 11/22/2009 11:52 AM, Michael Buesch wrote: > > On Thursday 19 November 2009 22:24:29 Michael Buesch wrote: > >> This rewrites the error handling policies in the TX status handler. > >> It tries to be error-tolerant as in "try hard to not crash the machine". > >> It won't recover from errors (that are bugs in the firmware or driver), > >> because that's impossible. However, it will return a more or less useful > >> error message and bail out. It also tries hard to use rate-limited messages > >> to not flood the syslog in case of a failure. > >> > >> Signed-off-by: Michael Buesch > > > > So did somebody try this with opensource firmware, yet? > > I'm testing now. So far, it has survived about 18 hours running tcpperf in one > console, and a flood ping in another. Cool. Thanks for testing. I'd have expected it to blow up, though. It's a little bit strange, because there still are reports of blowing up opensource firmware. This patch should produce better error messages in that case (it will not fix the blown firmware). > It looks really good, but I want at least > 24 hours before committing. Well, no. Just commit it, please. If this breaks, the _firmware_ has to be fixed. Not the patch. -- Greetings, Michael. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On 11/22/2009 11:52 AM, Michael Buesch wrote: > On Thursday 19 November 2009 22:24:29 Michael Buesch wrote: >> This rewrites the error handling policies in the TX status handler. >> It tries to be error-tolerant as in "try hard to not crash the machine". >> It won't recover from errors (that are bugs in the firmware or driver), >> because that's impossible. However, it will return a more or less useful >> error message and bail out. It also tries hard to use rate-limited messages >> to not flood the syslog in case of a failure. >> >> Signed-off-by: Michael Buesch > > So did somebody try this with opensource firmware, yet? I'm testing now. So far, it has survived about 18 hours running tcpperf in one console, and a flood ping in another. It looks really good, but I want at least 24 hours before committing. Larry ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [PATCH] b43: Rewrite DMA Tx status handling sanity checks
On Thursday 19 November 2009 22:24:29 Michael Buesch wrote: > This rewrites the error handling policies in the TX status handler. > It tries to be error-tolerant as in "try hard to not crash the machine". > It won't recover from errors (that are bugs in the firmware or driver), > because that's impossible. However, it will return a more or less useful > error message and bail out. It also tries hard to use rate-limited messages > to not flood the syslog in case of a failure. > > Signed-off-by: Michael Buesch So did somebody try this with opensource firmware, yet? -- Greetings, Michael. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
[PATCH] b43: Rewrite DMA Tx status handling sanity checks
This rewrites the error handling policies in the TX status handler. It tries to be error-tolerant as in "try hard to not crash the machine". It won't recover from errors (that are bugs in the firmware or driver), because that's impossible. However, it will return a more or less useful error message and bail out. It also tries hard to use rate-limited messages to not flood the syslog in case of a failure. Signed-off-by: Michael Buesch --- This patch also adds the skb pointer poisoning. I think it's not strictly needed to catch firmware bugs anymore, because those should be caught by the in-order check. However, we love defensive coding, so we try to make the code as robust as possible. I did not try with openfirmware, but I guess it blows up in the in-order check there pretty quickly. Would be cool if somebody could stress this on openfirmware. Note that if the ordering sanity check fails that can mean three things: - Either the report ordering on one engine is wrong. - We missed one report on the engine. - Or we reported the status to the wrong engine. Index: wireless-testing/drivers/net/wireless/b43/dma.c === --- wireless-testing.orig/drivers/net/wireless/b43/dma.c2009-11-18 19:21:17.0 +0100 +++ wireless-testing/drivers/net/wireless/b43/dma.c 2009-11-19 21:40:45.0 +0100 @@ -874,7 +874,7 @@ static void free_all_descbuffers(struct for (i = 0; i < ring->nr_slots; i++) { desc = ring->ops->idx2desc(ring, i, &meta); - if (!meta->skb) { + if (!meta->skb || b43_dma_ptr_is_poisoned(meta->skb)) { B43_WARN_ON(!ring->tx); continue; } @@ -926,7 +926,7 @@ struct b43_dmaring *b43_setup_dmaring(st enum b43_dmatype type) { struct b43_dmaring *ring; - int err; + int i, err; dma_addr_t dma_test; ring = kzalloc(sizeof(*ring), GFP_KERNEL); @@ -941,6 +941,8 @@ struct b43_dmaring *b43_setup_dmaring(st GFP_KERNEL); if (!ring->meta) goto err_kfree_ring; + for (i = 0; i < ring->nr_slots; i++) + ring->meta->skb = B43_DMA_PTR_POISON; ring->type = type; ring->dev = dev; @@ -1251,11 +1253,13 @@ struct b43_dmaring *parse_cookie(struct case 0x5000: ring = dma->tx_ring_mcast; break; - default: - B43_WARN_ON(1); } *slot = (cookie & 0x0FFF); - B43_WARN_ON(!(ring && *slot >= 0 && *slot < ring->nr_slots)); + if (unlikely(!ring || *slot < 0 || *slot >= ring->nr_slots)) { + b43dbg(dev->wl, "TX-status contains " + "invalid cookie: 0x%04X\n", cookie); + return NULL; + } return ring; } @@ -1494,19 +1498,40 @@ void b43_dma_handle_txstatus(struct b43_ struct b43_dmaring *ring; struct b43_dmadesc_generic *desc; struct b43_dmadesc_meta *meta; - int slot; + int slot, firstused; bool frame_succeed; ring = parse_cookie(dev, status->cookie, &slot); if (unlikely(!ring)) return; - B43_WARN_ON(!ring->tx); + + /* Sanity check: TX packets are processed in-order on one ring. +* Check if the slot deduced from the cookie really is the first +* used slot. */ + firstused = ring->current_slot - ring->used_slots + 1; + if (firstused < 0) + firstused = ring->nr_slots + firstused; + if (unlikely(slot != firstused)) { + /* This possibly is a firmware bug and will result in +* malfunction, memory leaks and/or stall of DMA functionality. */ + b43dbg(dev->wl, "Out of order TX status report on DMA ring %d. " + "Expected %d, but got %d\n", + ring->index, firstused, slot); + return; + } + ops = ring->ops; while (1) { - B43_WARN_ON(!(slot >= 0 && slot < ring->nr_slots)); + B43_WARN_ON(slot < 0 || slot >= ring->nr_slots); desc = ops->idx2desc(ring, slot, &meta); + if (b43_dma_ptr_is_poisoned(meta->skb)) { + b43dbg(dev->wl, "Poisoned TX slot %d (first=%d) " + "on ring %d\n", + slot, firstused, ring->index); + break; + } if (meta->skb) { struct b43_private_tx_info *priv_info = b43_get_priv_tx_info(IEEE80211_SKB_CB(meta->skb)); @@ -1522,7 +1547,14 @@ void b43_dma_handle_txstatus(struct b43_ if (meta->is_last_fragment) { struct ieee80211_tx_info *info; - BUG_ON(!meta->skb); +