Good day, I'm trying to figure out why my servers continue to be marked zombie, even though they continue to handle traffic. There appears to be no impact, just seemingly erroneous - or at least unexplained - log entries.
I have three 2.1.8 servers that feeds accounting to a 4th server (via copy_acct_to_homeserver), running 1.1.7. The primary servers also sends some auth (via proxy) and lots of acct (some via proxy, but also via copy_acct_to_homeserver) to a pair of Cisco ACS servers. The radius.log file for the primary servers show they are marking the 4th and Cisco (upstream) servers as zombie quite regularly (but not simultaneously); they thankfully never get marked dead. All of these servers are attached to the same Ethernet Switch with a slight detour through a Router that does VLAN routing between them; The Cisco servers also proxy to various other servers outside my network. I have a debug output from one of the servers that I have studied at length; I recorded the debug for 6 minutes or so before one of the servers was marked zombie. This is from a production machine with a fair amount of traffic, so the debug file is 9MBs. I'm not sure if it would be appropriate to post it to the mailing list. I'd be happy to post it. Do you want specific excerpts, or the whole thing? I've set the response_window to as high as 60 seconds in the clients.conf file and I keep the zombie_period at 20 seconds. I've also turned off the status_check feature as 1.1.7 and Cisco ACS do not appear to support it. The clients.conf file says that after the response_window is up (so 60 seconds) and "a" response is not received, that the server is marked Zombie. Based on what I see, I'm interpreting this as meaning that if one response is not seen in 60 seconds, even if hundreds of other responses were successfully sent and received during those 60 seconds, then the server is marked Zombie. At this point the Zombie_Period kicks in, and the moment "any" successful response is received the server is marked as completely alive. In my case the Zombie_Period is canceled immediately (though sadly the log does not seem to show when Zombie ends.) Odd to me is that occasionally a primary server will mark the same upstream server as Zombie multiple times over a handful of seconds, but the other two primary servers rarely mark upstream servers dead near the same time. I cannot find in the debug, or in packet captures, where a response went missing for a full minute, so I'm trying to find out what is happening. My upstream servers do not appear taxed or unresponsive. Perhaps there is some sort of malformed response I should be looking for? I also get radutmp errors about a wrong NAS ID, though on brief analysis it doesn't appear related. Any suggestions to help track this down and eliminate the error messages is greatly appreciated. -Benjamin
- List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html