I recently found the root cause of a long-standing performance problem with Reggie that we've suffered for years. Our djinns may have 10,000 services registered, so when Reggie boots up cold it gets slammed with thousands of TCP requests via LookupLocatorDiscovery, JoinManager.register() and ServiceDiscoveryManager.lookup(). In theory, this should be supportable because Reggie's read/write priority lock is pretty efficient, but two big technical complications have harmed our ability to scale:
1) PreferredClassProvider.lookupLoader() has a global lock. Behind that lock, URLClassLoaders are built which may trigger SocketPermission checks. That SocketPermission causes a reverse DNS lookup in getCanonName() because of the default Sun JRE lib/security/java.policy line: permission java.net.SocketPermission "localhost:1024-", "listen"; Because PolicyFile.add() prepends, this check is evaluated first even if you have local permissions that are more liberal. A handful of clients with bad DNS configurations can cause long timeouts that stall the whole process, causing eventual OutOfMemoryErrors because requests arrive faster than they can be fulfilled. Possible code solutions (aside from fixing DNS configuration, of course): a) switch PreferredClassProvider to a finer-grained lock (use the global lock to lookup the fine lock, and only hold the fine lock while doing creating the class loaders) b) defer some of the class loader construction so the DNS lookups happen after the PreferredClassProvider lock is released c) implement a replacement for SocketPermission and/or PermissionCollection which is smarter about the order it checks permissions to minimize the number DNS lookups 2) When Reggie shuts down and then restarts, it accidentally synchronizes all of the remote LookupLocatorDiscovery, who may restart their polling WakeupManagers at the same time. What we see is that several thousand TCP connections are all initiated within a few seconds of each other, despite the LookupLocatorDiscovery.LocatorReg.sleepTime values. When/if these unicast connections succeed, then we see thousands more TCP connections from JoinManager hitting Reggie in a giant wave. In VisualVM's performance graphs, I see Reggie go from 100 threads to 3000 threads in a couple of seconds, for example. Possible code solutions: a) add a random nudge to the polling interval in LookupLocatorDiscovery, like the unicastDelayRange in the LocatorDiscovery class. This would gradually desynchronize the clients b) likewise for JoinManager, perhaps These conditions are hard to reproduce in a typical lab, because they require large numbers of machines and deliberately misconfigured DNS. I'd appreciate any thoughts that others have about Reggie scaling issues. Chris
