Claude Brown wrote:
> My original reply was confusingly brief. I've clarified below, and I've also 
> put the module we wrote into github in case it helps:
> 
> https://github.com/claudebrown/freeradius-server/compare/master...rlm_tagfiles

  OK.  It's... odd.

> We avoided both "fastfile" and reloading "files" on the fly because of the 
> number of updates we have to our user setup.  The rate of change to our 
> customers would require a reload every few seconds during most of the day.

  I'd normally just put users into SQL.

> We had concerns in two areas:
> - The time to re-write the config and then re-load so frequently. This may 
> become a performance problem as our user base grows out to 250K
> - The risk of using the reload mechanism in a way that didn't seem consistent 
> with its design intent, or the likely usage pattern of reloads every day or 
> every few hours.

  OK.  Reloads don't work for you.

> FreeRADIUS core is very stable. But MySQL adds instability we have been 
> unable to identify or reproduce in our environment.

  That's odd.  While MySQL isn't perfect, I have successfully used it in
systems with 100's of transactions/s.  There was a VoIP provider ~8
years ago using it with ~1K authentications/s.

> When large parts of our WiMAX network are restarted due to maintenance or 
> failure the customer devices re-join the network. Whilst this doesn't happen 
> often, when it does happen we need to get as many as 50K devices will 
> simultaneously ask to rejoin the network.  We need to service this sudden and 
> dramatic backlog as quickly as possible.

  Yup.

> With the "files" module this is a breeze with a single server.  It just eats 
> it up and everything comes back in a few minutes. Importantly, our testing 
> shows the design goal of 250K users would also be met with one server.
> 
> But with "rlm_sql" and MySQL we could not do it. The radiusd would start 
> slowly grinding to a halt roughly as we reached 200 auths per sec (with EAP, 
> this is about 30 devices per sec).  The radiusd log reported "Unresponsive 
> child" in a MySQL module and gradually all the database concurrency would 
> disappear as those threads were lost for further work.

  MySQL does have concurrency issues.  But if you split it into
auth/acct, most of those go away.  i.e. use one SQL module for
authentication queries.  Use a *different* one for accounting inserts.

  If you also use the decoupled-accounting method (see
raddb/sites-available), MySQL gets even faster.  Having only one process
doing inserts can speed up MySQL by 3-4x.

> With our new far simpler approach, all of this has gone away because we are 
> now using the "files" module and "users" file directly. The speed of 
> authentication is essentially as per that module.

  OK.

> The value of the extra attribute is in essence obtained like this:
> 1. Format a filename such as "/blah/%{Username}"
> 2. Read a line from this file

  Using a database WILL be faster than reading the file system.

> We only have about 10 different values in these files: things like 
> "voip-customer", "payment-overdue", "gold-customer", 
> "exceeded-download-limit", etc.  The value is used to select a DEFAULT entry 
> in the "users" file that builds the reply attributes needed to configure the 
> customers service.

  You can do the same kind of thing with SQL.  Simply create a table,
and do:

   update request {
      My-Magic-Attr = "%{sql: SELECT .. from ..}"
   }

  Have the table contain the mapping of User-Name --> "voip-customer".
You should be able to get very high performance.  Then, use that
attribute to do the mappings in the "users" file, just like you do today.

> This happens when we have a major network event that causes lots of devices 
> to simultaneously request authentication. Due to the unpredictable loss of 
> threads, we have to manually manage the rate of the incoming authentications 
> by slowly starting small sections of the network at a time.
> 
> This process takes us hours of careful (manual) rate management.

  That's just weird.  SQL should be fine, *if* you design the system
carefully.  That's the key.

> Possibly, but we couldn't find a way. We would be keen to understand the fix 
> for this.

  See above.

> We had no problem during normal operation.  It was only when large numbers of 
> devices (typically 10K or more) simultaneously needed to re-join the network 
> for some reason. 
> 
> Do you know if these other sites have those kinds of events?

  *Everyone* has this happen.  There's really no need for a new module.

> However, the stability issue would never go away. To me it smells of a race 
> condition somewhere in the MySQL library. As we could only ever reproduce it 
> by cycling 10K or more users, it was proving very difficult to debug.

  It's not a race condition, it's lock contention.

> But we spent far less time coding & testing a few 100 lines of "C" code than 
> all the effort over the previous 18 months trying to reproduce, isolate or 
> workaround the MySQL problem.  We gave up.
> 
> A nice bonus is that we can now head towards a single server configuration 
> with a file-system database. This will allow us to retire a raft of servers 
> doing proxying, multiple radiusd, and multiple MySQL instances.

  If it works for you...

  But it's really just a re-implementation of a simple SQL table.  It's
a solution which is specific to your environment.

  The more generic solution is:

- custom tables
- split auth/acct
- decouple acct from the "live" server

  You should be able to get a very high performance with that.  The
benefit is you'll be using real databases, which is usually a good idea.

  Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html

Reply via email to