Claude Brown wrote: > My original reply was confusingly brief. I've clarified below, and I've also > put the module we wrote into github in case it helps: > > https://github.com/claudebrown/freeradius-server/compare/master...rlm_tagfiles
OK. It's... odd. > We avoided both "fastfile" and reloading "files" on the fly because of the > number of updates we have to our user setup. The rate of change to our > customers would require a reload every few seconds during most of the day. I'd normally just put users into SQL. > We had concerns in two areas: > - The time to re-write the config and then re-load so frequently. This may > become a performance problem as our user base grows out to 250K > - The risk of using the reload mechanism in a way that didn't seem consistent > with its design intent, or the likely usage pattern of reloads every day or > every few hours. OK. Reloads don't work for you. > FreeRADIUS core is very stable. But MySQL adds instability we have been > unable to identify or reproduce in our environment. That's odd. While MySQL isn't perfect, I have successfully used it in systems with 100's of transactions/s. There was a VoIP provider ~8 years ago using it with ~1K authentications/s. > When large parts of our WiMAX network are restarted due to maintenance or > failure the customer devices re-join the network. Whilst this doesn't happen > often, when it does happen we need to get as many as 50K devices will > simultaneously ask to rejoin the network. We need to service this sudden and > dramatic backlog as quickly as possible. Yup. > With the "files" module this is a breeze with a single server. It just eats > it up and everything comes back in a few minutes. Importantly, our testing > shows the design goal of 250K users would also be met with one server. > > But with "rlm_sql" and MySQL we could not do it. The radiusd would start > slowly grinding to a halt roughly as we reached 200 auths per sec (with EAP, > this is about 30 devices per sec). The radiusd log reported "Unresponsive > child" in a MySQL module and gradually all the database concurrency would > disappear as those threads were lost for further work. MySQL does have concurrency issues. But if you split it into auth/acct, most of those go away. i.e. use one SQL module for authentication queries. Use a *different* one for accounting inserts. If you also use the decoupled-accounting method (see raddb/sites-available), MySQL gets even faster. Having only one process doing inserts can speed up MySQL by 3-4x. > With our new far simpler approach, all of this has gone away because we are > now using the "files" module and "users" file directly. The speed of > authentication is essentially as per that module. OK. > The value of the extra attribute is in essence obtained like this: > 1. Format a filename such as "/blah/%{Username}" > 2. Read a line from this file Using a database WILL be faster than reading the file system. > We only have about 10 different values in these files: things like > "voip-customer", "payment-overdue", "gold-customer", > "exceeded-download-limit", etc. The value is used to select a DEFAULT entry > in the "users" file that builds the reply attributes needed to configure the > customers service. You can do the same kind of thing with SQL. Simply create a table, and do: update request { My-Magic-Attr = "%{sql: SELECT .. from ..}" } Have the table contain the mapping of User-Name --> "voip-customer". You should be able to get very high performance. Then, use that attribute to do the mappings in the "users" file, just like you do today. > This happens when we have a major network event that causes lots of devices > to simultaneously request authentication. Due to the unpredictable loss of > threads, we have to manually manage the rate of the incoming authentications > by slowly starting small sections of the network at a time. > > This process takes us hours of careful (manual) rate management. That's just weird. SQL should be fine, *if* you design the system carefully. That's the key. > Possibly, but we couldn't find a way. We would be keen to understand the fix > for this. See above. > We had no problem during normal operation. It was only when large numbers of > devices (typically 10K or more) simultaneously needed to re-join the network > for some reason. > > Do you know if these other sites have those kinds of events? *Everyone* has this happen. There's really no need for a new module. > However, the stability issue would never go away. To me it smells of a race > condition somewhere in the MySQL library. As we could only ever reproduce it > by cycling 10K or more users, it was proving very difficult to debug. It's not a race condition, it's lock contention. > But we spent far less time coding & testing a few 100 lines of "C" code than > all the effort over the previous 18 months trying to reproduce, isolate or > workaround the MySQL problem. We gave up. > > A nice bonus is that we can now head towards a single server configuration > with a file-system database. This will allow us to retire a raft of servers > doing proxying, multiple radiusd, and multiple MySQL instances. If it works for you... But it's really just a re-implementation of a simple SQL table. It's a solution which is specific to your environment. The more generic solution is: - custom tables - split auth/acct - decouple acct from the "live" server You should be able to get a very high performance with that. The benefit is you'll be using real databases, which is usually a good idea. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html