To ALL: We are running 6 7240xm controllers in a cluster with 8.3.0.7 code. We are seeing something similar but not exactly. We are also seeing re-bootstrapping but our “cluster heartbeat” looks fine. What we noticed is after some type of network event where the ap’s temporarily loose contact with the controller a bunch of them will continually re-bootstrap over and over again. They do it until they get hung and lockup or you reboot them. Once you reboot the ap, it is good until the next event.
Because we can get it to stop by rebooting the ap, we don’t think it is a networking issue but rather a bug in the code. We have ap’s with hundreds of rebootstrap events. It is very hard to tell if they are doing it because it is hidden pretty well and there is no real “indicator”. We first noticed that if you watch our downed ap list it would constantly fluctuate by about 3-4 ap’s and they were seeming always different. They weren’t, it was just that we had a ton doing it. Digging into it you can see by looking at the “show ap bss-table” and see the time on the BSS. We had a bunch with really short times, under an hour. If we watched them you could see the constant re-bootstrapping. If you rebooted the ap (ad hard reboot) it would stop. We have an open case with tac and are pursuing. From: The EDUCAUSE Wireless Issues Community Group Listserv [mailto:WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU] On Behalf Of Miller, Keith C Sent: Friday, December 6, 2019 7:52 PM To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU Subject: [WIRELESS-LAN] ArubaOS 8.x cluster disconnects Hello all, As many of you know, we’re an Aruba shop and we’re running multiple versions of 8.x in our environment. We are also a Nyansa Voyance customer and for those who are also Nyansa customers will probably remember back in October when they changed the default behavior for AP down/reboot events from “No Priority” to “Always P2”. Almost immediately, we began receiving alerts from Voyance about large amounts of APs going down at the same time. After looking at our controllers and other NMS tools, we realized that the APs were not actually going down, but the radios on the APs were rebootstrapping. For those unfamiliar with what rebootstrapping is, it essentially means that the radios of the AP rebooted, but the AP itself stayed up. This is typically caused by missed heartbeats and/or when an AP reconnects to a controller. In a clustered environment, when a controller fails, an AP should gracefully move to its S-AAC with little to no impact. However, in our case we were seeing APs not gracefully failover after missing heartbeats and this was causing the rebootstraps. This impacts clients and our users so obviously we were very concerned with what we had found. After opening a case with Aruba TAC, we discovered that the cluster members were disconnecting from each other. You can see if this is happening in your environment by running the “show lc-cluster heartbeat counters” command on one of the MDs in a cluster. You’re looking for the last column that indicates the last time of disconnect. For us, this has been occurring in multiple environments (8.3, 8.4, and 8.5) at least since we began looking into it back in October. We’ve sent many logs, traces, and now packet captures to the Aruba TAC team. At the request of TAC, we’ve changed heartbeat thresholds and enabled BCMC optimization on VLAN interfaces even though we have it enabled at the SSID level. While some of these efforts have slowed down the frequency of the disconnects, they are still occurring. So I’m looking to get some feedback from those that are running AOS 8.x in their environment. Are you seeing this problem in your environment? Lastly, if you’re experiencing this issue or you’re just interested in finding out more about the health of your environment, you can also verify if you have APs that are rebootstrapping with the “show ap debug counters” command. If you want to isolate a particular AP and gather more information, you can run the “show ap debug system-status ap-name” command. Here’s what it looks like when the AP doesn’t gracefully failover: Cluster Failover Information ---------------------------- Date Time Reason (Latest 10) -------------------------------------- 2019-11-25 01:10:20 Delete A-AAC:172.27.xx.xx, cluster enabled=1. fail-over to 172.27.xx.xx, sby status=1 Thanks in advance for any and all feedback. Regards, Keith C. Miller Wireless Architect, ITS Comm. Technologies University of North Carolina Chapel Hill O: (919)962-6564 M: (803)464-2397 | keith.mil...@unc.edu<mailto:keith.mil...@unc.edu> ********** Replies to EDUCAUSE Community Group emails are sent to the entire community list. If you want to reply only to the person who sent the message, copy and paste their email address and forward the email reply. Additional participation and subscription information can be found at https://www.educause.edu/community ********** Replies to EDUCAUSE Community Group emails are sent to the entire community list. If you want to reply only to the person who sent the message, copy and paste their email address and forward the email reply. Additional participation and subscription information can be found at https://www.educause.edu/community