Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?

2005-10-05 Thread Martin Pauly
Alan,
  Thu Sep 29 20:33:19 2005 : Error: Assertion failed in modcall.c, line 68
   If you can get a core dump, and do 'bt' in gdb, and also do 'print
 *p' at the point of the assertion, that would help a lot.
 
   But my main suspect right now is bad memory.  The code hasn't
 changed in a long time, and I doubt you're doing anything really weird
 to the server.
well, I'm trying hard not to confuse my dear servers :-)

Funny enough, following an advice from our LDAP admin
I changed the ldap query directive sequence in radiusd.conf 
on one machine from 
Auth-Type LDAP {
redundant {
  ldap1
  ldap2
  ldap3
}
}
to 
Auth-Type LDAP {
redundant {
  ldap3
  ldap1
}
I.e. I avoided our most loaded LDAP server.
I also enabled coredumps and ran in full debug mode all Friday.
Guess what? No crashes over the long weekend (we had a holiday on monday)
Given the erratic behavior, I will indeed give the hardware a closer look.

Thanks so far
Martin
-- 
  Dr. Martin Pauly Fax:49-6421-28-26994
  HRZ Univ. MarburgPhone:  49-6421-28-23527
  Hans-Meerwein-Str.   E-Mail: [EMAIL PROTECTED]  
  D-35032 Marburg   
- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?

2005-09-30 Thread Alan DeKok
Dr. Martin Pauly [EMAIL PROTECTED] wrote:
 we are crashing every couple of hours or so now, but at least this time
 got something in the log:
 
 Thu Sep 29 20:33:19 2005 : Error: Assertion failed in modcall.c, line 68
 
 Looks like there might be some more bug-squashing ahead? :-))
 I will try to run in debug mode tomorrow so we can get some more
 information on the problems (at least, they seem fairly reproducible).

  If you can get a core dump, and do 'bt' in gdb, and also do 'print
*p' at the point of the assertion, that would help a lot.

  But my main suspect right now is bad memory.  The code hasn't
changed in a long time, and I doubt you're doing anything really weird
to the server.

  Alan DeKok.

- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?

2005-09-29 Thread Martin Pauly
   Yes.  If all of the threads are blocked forever, waiting for the DB
 to return data, then the queue of requests grows without bounds.  At
 some point, the server says I'm not making progress, and I can't
 recover from this, and kills itself.
hm, I thought the timeout values were for this, but I now understand
that an LDAP communication might get stuck halfway, thus _not_ 
triggering a timeout event.

   Since the server is *already* effectively dead at that point, it
 makes no difference to your network.

   The solution is to fix the database so that it doesn't kill the
 server.
well, we should perhaps be able to wait for a database going and
come back again after a minute without crashing the daemon.

Anyway, I'm now going with an increased ldap_connections_number (100 instead of 
5),
and increased LDAP timeouts as well. 
What about max_request_time and delete_blocked_requests -- isn't this
exactly what is needed to protect the server from being blocked?

Cheers, Martin

-- 
  Dr. Martin Pauly Fax:49-6421-28-26994
  HRZ Univ. MarburgPhone:  49-6421-28-23527
  Hans-Meerwein-Str.   E-Mail: [EMAIL PROTECTED]  
  D-35032 Marburg   
- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?

2005-09-29 Thread Alan DeKok
Martin Pauly [EMAIL PROTECTED] wrote:
 What about max_request_time and delete_blocked_requests -- isn't this
 exactly what is needed to protect the server from being blocked?

  Yes, but the server doesn't deal well with blocked threads.  The
delete_blocked_requests doesn't really work.

  We hope to fix this in the next major version of the server.

  Alan DeKok.
- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?

2005-09-29 Thread Dr. Martin Pauly
hi,

we are crashing every couple of hours or so now, but at least this time
got something in the log:

Thu Sep 29 20:33:19 2005 : Error: Assertion failed in modcall.c, line 68

Looks like there might be some more bug-squashing ahead? :-))
I will try to run in debug mode tomorrow so we can get some more
information on the problems (at least, they seem fairly reproducible).

Martin

--
  Dr. Martin Pauly Fax:49-6421-28-26994
  HRZ Univ. MarburgPhone:  49-6421-28-23527
  Hans-Meerwein-Str.   E-Mail: [EMAIL PROTECTED]
  D-35032 Marburg


- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?

2005-09-28 Thread Martin Pauly
Hi,

we seem to have a stability issue with freeradius 1.0.4/1.0.5:
1.0.4 crashed in short sequence on both of my redundant servers
during my vacation -- not much of a trace in the logfiles.

On Monday, I upgraded to 1.0.5 with everything looking fine for
almost 2 days. Yesterday, we started polling the servers regularly 
from a NAGIOS system, using the check_rad NAGIOS plugin.

On server (the one processing the highest number of requests) 
crashed twice yesterday; this time it complained about 
Unresponsive child processes in close temporal correlation.

We do have perfomance problems with our LDAP backend,
so this sound reasonable, but could this cause the server to crash?

During testing, I also encountered a situation where the freeradius 
process lived on, but became comletely unresponsive; I had to to kill -9

What should I do to track down these issues? Does running in full debug
mode for days make sense?

Thanks, Martin

-- 
  Dr. Martin Pauly Fax:49-6421-28-26994
  HRZ Univ. MarburgPhone:  49-6421-28-23527
  Hans-Meerwein-Str.   E-Mail: [EMAIL PROTECTED]  
  D-35032 Marburg   
- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Re: Crashes with 1.0.4/1.0.5, perhaps connected with slow LDAP backend?

2005-09-28 Thread Alan DeKok
Martin Pauly [EMAIL PROTECTED] wrote:
 We do have perfomance problems with our LDAP backend,
 so this sound reasonable, but could this cause the server to crash?

  Yes.  If all of the threads are blocked forever, waiting for the DB
to return data, then the queue of requests grows without bounds.  At
some point, the server says I'm not making progress, and I can't
recover from this, and kills itself.

  Since the server is *already* effectively dead at that point, it
makes no difference to your network.

  The solution is to fix the database so that it doesn't kill the
server.

  Alan DeKok.

- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html