Re: freeradius 2.1.8 dies Error: ASSERT FAILED event.c[1084]: home->ev != NULL

2010-03-25 Thread Alan DeKok
fab junkmail wrote:
>>  Why is it running out of sockets?  This shouldn't happen.
> 
> Not sure but there is a _lot_ of attempted proxying going on - maybe
> it just went over the system limits like open file limits or
> something? In any case it probably won't be a problem when I implement
> the robust-proxy-accounting.

  Likely, yes.

  If the server is overloaded and unable to proxy packets... who knows
what can happen.  The *intent* is to have it still work, but it's a
poorly tested code path.

  Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Re: freeradius 2.1.8 dies Error: ASSERT FAILED event.c[1084]: home->ev != NULL

2010-03-25 Thread fab junkmail
Hi Alan,

Thanks for your response.

Alan DeKok wrote:
>  You can configure the proxy to log accounting packets to disk when the
> home server is down.  See raddb/sites-available/robust-proxy-accounting

Ok I will definitely do this then.

>> Fri Mar 19 17:30:54 2010 : Proxy: Failed to create a new socket for
>> proxying requests.
>
>  Why is it running out of sockets?  This shouldn't happen.

Not sure but there is a _lot_ of attempted proxying going on - maybe
it just went over the system limits like open file limits or
something? In any case it probably won't be a problem when I implement
the robust-proxy-accounting.

>  You have a NAS which is sending large amounts of traffic to a proxy
> when the home server is down.  The proxy isn't configured to do anything
> useful with the packets.  This is a bug in the *architecture*.

Understood.

Thanks for your help Alan.

Regards,
Anthony

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Re: freeradius 2.1.8 dies Error: ASSERT FAILED event.c[1084]: home->ev != NULL

2010-03-25 Thread Alan DeKok
fab junkmail wrote:
> I recently upgraded our freeradius servers to 2.1.8 and over the past
> month it has died on one of the servers two times (spaced about two
> weeks apart I think). So fairly infrequently.

  OK.

> A bit of background, We use this server predominantly to proxy
> requests. Every day for about 15 minutes, the two main home servers we
> proxy to stop responding (they are doing backups or maintenance during
> this time) so for those 15 minutes our clients (LNS/NAS) would be
> sending a very large number of accounting interim packets and some
> stop packets and would be resending these while the home servers are
> down.

  You can configure the proxy to log accounting packets to disk when the
home server is down.  See raddb/sites-available/robust-proxy-accounting

> Sun Mar 14 17:30:15 2010 : Proxy: Marking home server 10.0.1.48
> port 1646 as zombie (it looks like it is dead).
> Sun Mar 14 17:30:16 2010 : Proxy: Marking home server 10.0.1.47
> port 1646 as zombie (it looks like it is dead).
> Sun Mar 14 17:30:19 2010 : Proxy: Marking home server 10.0.1.47
> port 1645 as zombie (it looks like it is dead).
> Sun Mar 14 17:30:19 2010 : Error: No response to status check 903535
> for home server 10.0.1.48 port 1646
> Sun Mar 14 17:30:20 2010 : Error: No response to status check 903536
> for home server 10.0.1.47 port 1646
> ...
> Sun Mar 14 17:30:32 2010 : Error: Internal sanity check failed for
> child state

  Hmm... that's not good.

> Fri Mar 19 17:30:54 2010 : Proxy: Failed to create a new socket for
> proxying requests.

  Why is it running out of sockets?  This shouldn't happen.

> Fri Mar 19 17:30:54 2010 : Proxy: Failed to create a new socket for
> proxying requests.
> Fri Mar 19 17:30:54 2010 : Proxy: Failed to create a new socket for
> proxying requests.
> ...
> Fri Mar 19 17:30:56 2010 : Error: ASSERT FAILED event.c[1084]:
> home->ev != NULL

  Well... after all of the previous errors, it's not surprising that
something *worse* eventually goes wrong.  It's like driving your car for
45 minutes after the tires are flat: not a good idea.

> That last one is where it dies I think.

  Yes.

> That one was found to be a bug and was fixed - I don't know if my case
> is a bug though.

  It's a bug, but the other problems you're seeing should be fixed, too.

> I don't currently use the robust proxy accounting that that thread
> suggests. I expect that would probably work around the issue of
> freeradius crashing in this case and I will give that a go.

  Yes.

> Just
> posting this to let you know that it _might_ be a bug and to ask for
> advice about whether you think this is a bug or not, and if I should
> follow up on that, or if you think it is just my configuration that
> needs some changes and what areas I should concentrate on if that is
> the case?

  You have a NAS which is sending large amounts of traffic to a proxy
when the home server is down.  The proxy isn't configured to do anything
useful with the packets.  This is a bug in the *architecture*.

  Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


freeradius 2.1.8 dies Error: ASSERT FAILED event.c[1084]: home->ev != NULL

2010-03-24 Thread fab junkmail
I recently upgraded our freeradius servers to 2.1.8 and over the past
month it has died on one of the servers two times (spaced about two
weeks apart I think). So fairly infrequently.

A bit of background, We use this server predominantly to proxy
requests. Every day for about 15 minutes, the two main home servers we
proxy to stop responding (they are doing backups or maintenance during
this time) so for those 15 minutes our clients (LNS/NAS) would be
sending a very large number of accounting interim packets and some
stop packets and would be resending these while the home servers are
down.

Some relevant proxy home server settings we currently use for the main
home servers we proxy to:

response_window = 14
zombie_period = 40
status_check = status-server
check_interval = 30
num_answers_to_alive = 3


The times that freeradius has died has been near the end of the 15
minutes of the home servers downtime.

The last time this happened, I noticed logs in the attached file. Ones
that sound relevant as follows:

Sun Mar 14 17:30:15 2010 : Proxy: Marking home server 10.0.1.48
port 1646 as zombie (it looks like it is dead).
Sun Mar 14 17:30:16 2010 : Proxy: Marking home server 10.0.1.47
port 1646 as zombie (it looks like it is dead).
Sun Mar 14 17:30:19 2010 : Proxy: Marking home server 10.0.1.47
port 1645 as zombie (it looks like it is dead).
Sun Mar 14 17:30:19 2010 : Error: No response to status check 903535
for home server 10.0.1.48 port 1646
Sun Mar 14 17:30:20 2010 : Error: No response to status check 903536
for home server 10.0.1.47 port 1646
...
Sun Mar 14 17:30:32 2010 : Error: Internal sanity check failed for
child state
...
Fri Mar 19 17:30:54 2010 : Proxy: Failed to create a new socket for
proxying requests.
Fri Mar 19 17:30:54 2010 : Proxy: Failed to create a new socket for
proxying requests.
Fri Mar 19 17:30:54 2010 : Proxy: Failed to create a new socket for
proxying requests.
...
Fri Mar 19 17:30:56 2010 : Error: ASSERT FAILED event.c[1084]:
home->ev != NULL


That last one is where it dies I think.

That last error seems a bit similar (but a bit different) to the
following thread:

http://www.mail-archive.com/freeradius-users@lists.freeradius.org/msg58052.html

">Re: ASSERT FAILED event.c in 2.1.7
>Alan DeKok
>Fri, 25 Sep 2009 01:19:40 -0700
>
>Maja Wolniewicz wrote:
>> After the upgrade from 2.1.6 to 2.1.7 my two servers died 3-4 times
>> daily with the following error:
>>
>> Thu Sep 24 19:07:13 2009 : Error: Received conflicting packet from
>> client AP-8 port 32777 - ID: 240 due to unfinished request 2396.  Giving
>> up on old request.
>> Thu Sep 24 19:07:13 2009 : Error: ASSERT FAILED event.c[2682]:
>> request->ev != NULL
>>
>> I have to return to 2.1.6, which works smoothly.
>
 > The simplest thing to do in the short term is to delete the assertion.
>
> Alan DeKok.
"

That one was found to be a bug and was fixed - I don't know if my case
is a bug though.


Another thread that sounds useful for this is:

http://www.mail-archive.com/freeradius-users@lists.freeradius.org/msg59985.html

"...
>Until then, configuring status-checks && local detail files will
>definitely help. I would recommend doing that *anyways* for network
>stability.
>
>Alan DeKok."


I don't currently use the robust proxy accounting that that thread
suggests. I expect that would probably work around the issue of
freeradius crashing in this case and I will give that a go. Just
posting this to let you know that it _might_ be a bug and to ask for
advice about whether you think this is a bug or not, and if I should
follow up on that, or if you think it is just my configuration that
needs some changes and what areas I should concentrate on if that is
the case?

Regards,
Anthony
Sun Mar 14 17:30:15 2010 : Proxy: Marking home server 10.0.1.48 
port 1646 as zombie (it looks like it is dead).
Sun Mar 14 17:30:16 2010 : Proxy: Marking home server 10.0.1.47 
port 1646 as zombie (it looks like it is dead).
Sun Mar 14 17:30:19 2010 : Proxy: Marking home server 10.0.1.47 
port 1645 as zombie (it looks like it is dead).
Sun Mar 14 17:30:19 2010 : Error: No response to status check 903535 
for home server 10.0.1.48 port 1646
Sun Mar 14 17:30:20 2010 : Error: No response to status check 903536 
for home server 10.0.1.47 port 1646
Sun Mar 14 17:30:23 2010 : Error: No response to status check 61094 
for home server 10.0.1.47 port 1645
Sun Mar 14 17:30:31 2010 : Error: rlm_radutmp: Logout entry for NAS 
lns02 port 2520 has wrong ID
Sun Mar 14 17:30:32 2010 : Error: Internal sanity check failed for 
child state
Sun Mar 14 17:30:32 2010 : Error: Reply from home server 10.0.1.48 
port 1646  - ID: 224 arrived too late for request 903469. Try 
increasing 'retry_delay' or 'max_request_time'
Sun Mar 14 17:30:33 2010 : Proxy: Marking home server 10.0.1.48 
port 1645 as zombie (it looks like it is dead).
Sun Mar 14 17:30:34 2010 : Error: Internal sanity check failed for 
child state
Sun Mar 14 17:30:34 2010 : Error: