Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?
Alan, Thu Sep 29 20:33:19 2005 : Error: Assertion failed in modcall.c, line 68 If you can get a core dump, and do 'bt' in gdb, and also do 'print *p' at the point of the assertion, that would help a lot. But my main suspect right now is bad memory. The code hasn't changed in a long time, and I doubt you're doing anything really weird to the server. well, I'm trying hard not to confuse my dear servers :-) Funny enough, following an advice from our LDAP admin I changed the ldap query directive sequence in radiusd.conf on one machine from Auth-Type LDAP { redundant { ldap1 ldap2 ldap3 } } to Auth-Type LDAP { redundant { ldap3 ldap1 } I.e. I avoided our most loaded LDAP server. I also enabled coredumps and ran in full debug mode all Friday. Guess what? No crashes over the long weekend (we had a holiday on monday) Given the erratic behavior, I will indeed give the hardware a closer look. Thanks so far Martin -- Dr. Martin Pauly Fax:49-6421-28-26994 HRZ Univ. MarburgPhone: 49-6421-28-23527 Hans-Meerwein-Str. E-Mail: [EMAIL PROTECTED] D-35032 Marburg - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?
Dr. Martin Pauly [EMAIL PROTECTED] wrote: we are crashing every couple of hours or so now, but at least this time got something in the log: Thu Sep 29 20:33:19 2005 : Error: Assertion failed in modcall.c, line 68 Looks like there might be some more bug-squashing ahead? :-)) I will try to run in debug mode tomorrow so we can get some more information on the problems (at least, they seem fairly reproducible). If you can get a core dump, and do 'bt' in gdb, and also do 'print *p' at the point of the assertion, that would help a lot. But my main suspect right now is bad memory. The code hasn't changed in a long time, and I doubt you're doing anything really weird to the server. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?
Yes. If all of the threads are blocked forever, waiting for the DB to return data, then the queue of requests grows without bounds. At some point, the server says I'm not making progress, and I can't recover from this, and kills itself. hm, I thought the timeout values were for this, but I now understand that an LDAP communication might get stuck halfway, thus _not_ triggering a timeout event. Since the server is *already* effectively dead at that point, it makes no difference to your network. The solution is to fix the database so that it doesn't kill the server. well, we should perhaps be able to wait for a database going and come back again after a minute without crashing the daemon. Anyway, I'm now going with an increased ldap_connections_number (100 instead of 5), and increased LDAP timeouts as well. What about max_request_time and delete_blocked_requests -- isn't this exactly what is needed to protect the server from being blocked? Cheers, Martin -- Dr. Martin Pauly Fax:49-6421-28-26994 HRZ Univ. MarburgPhone: 49-6421-28-23527 Hans-Meerwein-Str. E-Mail: [EMAIL PROTECTED] D-35032 Marburg - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?
Martin Pauly [EMAIL PROTECTED] wrote: What about max_request_time and delete_blocked_requests -- isn't this exactly what is needed to protect the server from being blocked? Yes, but the server doesn't deal well with blocked threads. The delete_blocked_requests doesn't really work. We hope to fix this in the next major version of the server. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?
hi, we are crashing every couple of hours or so now, but at least this time got something in the log: Thu Sep 29 20:33:19 2005 : Error: Assertion failed in modcall.c, line 68 Looks like there might be some more bug-squashing ahead? :-)) I will try to run in debug mode tomorrow so we can get some more information on the problems (at least, they seem fairly reproducible). Martin -- Dr. Martin Pauly Fax:49-6421-28-26994 HRZ Univ. MarburgPhone: 49-6421-28-23527 Hans-Meerwein-Str. E-Mail: [EMAIL PROTECTED] D-35032 Marburg - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?
Hi, we seem to have a stability issue with freeradius 1.0.4/1.0.5: 1.0.4 crashed in short sequence on both of my redundant servers during my vacation -- not much of a trace in the logfiles. On Monday, I upgraded to 1.0.5 with everything looking fine for almost 2 days. Yesterday, we started polling the servers regularly from a NAGIOS system, using the check_rad NAGIOS plugin. On server (the one processing the highest number of requests) crashed twice yesterday; this time it complained about Unresponsive child processes in close temporal correlation. We do have perfomance problems with our LDAP backend, so this sound reasonable, but could this cause the server to crash? During testing, I also encountered a situation where the freeradius process lived on, but became comletely unresponsive; I had to to kill -9 What should I do to track down these issues? Does running in full debug mode for days make sense? Thanks, Martin -- Dr. Martin Pauly Fax:49-6421-28-26994 HRZ Univ. MarburgPhone: 49-6421-28-23527 Hans-Meerwein-Str. E-Mail: [EMAIL PROTECTED] D-35032 Marburg - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?
Martin Pauly [EMAIL PROTECTED] wrote: We do have perfomance problems with our LDAP backend, so this sound reasonable, but could this cause the server to crash? Yes. If all of the threads are blocked forever, waiting for the DB to return data, then the queue of requests grows without bounds. At some point, the server says I'm not making progress, and I can't recover from this, and kills itself. Since the server is *already* effectively dead at that point, it makes no difference to your network. The solution is to fix the database so that it doesn't kill the server. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html