SocketPermission and LookupLocatorDiscovery vs. Reggie scalability

Christopher Dolan Mon, 11 Apr 2011 08:00:48 -0700

I recently found the root cause of a long-standing performance problem
with Reggie that we've suffered for years. Our djinns may have 10,000
services registered, so when Reggie boots up cold it gets slammed with
thousands of TCP requests via LookupLocatorDiscovery,
JoinManager.register() and ServiceDiscoveryManager.lookup().  In theory,
this should be supportable because Reggie's read/write priority lock is
pretty efficient, but two big technical complications have harmed our
ability to scale:


 

  1) PreferredClassProvider.lookupLoader() has a global lock. Behind
that lock, URLClassLoaders are built which may trigger SocketPermission
checks. That SocketPermission causes a reverse DNS lookup in
getCanonName() because of the default Sun JRE lib/security/java.policy
line: 

   permission java.net.SocketPermission "localhost:1024-", "listen"; 

Because PolicyFile.add() prepends, this check is evaluated first even if
you have local permissions that are more liberal. A handful of clients
with bad DNS configurations can cause long timeouts that stall the whole
process, causing eventual OutOfMemoryErrors because requests arrive
faster than they can be fulfilled.

 

Possible code solutions (aside from fixing DNS configuration, of
course):

a) switch PreferredClassProvider to a finer-grained lock (use the global
lock to lookup the fine lock, and only hold the fine lock while doing
creating the class loaders)

b) defer some of the class loader construction so the DNS lookups happen
after the PreferredClassProvider lock is released

c) implement a replacement for SocketPermission and/or
PermissionCollection which is smarter about the order it checks
permissions to minimize the number DNS lookups

 

 2) When Reggie shuts down and then restarts, it accidentally
synchronizes all of the remote LookupLocatorDiscovery, who may restart
their polling WakeupManagers at the same time. What we see is that
several thousand TCP connections are all initiated within a few seconds
of each other, despite the LookupLocatorDiscovery.LocatorReg.sleepTime
values. When/if these unicast connections succeed, then we see thousands
more TCP connections from JoinManager hitting Reggie in a giant wave.
In VisualVM's performance graphs, I see Reggie go from 100 threads to
3000 threads in a couple of seconds, for example.

 

Possible code solutions:

a) add a random nudge to the polling interval in LookupLocatorDiscovery,
like the unicastDelayRange in the LocatorDiscovery class. This would
gradually desynchronize the clients

b) likewise for JoinManager, perhaps

 

 

These conditions are hard to reproduce in a typical lab, because they
require large numbers of machines and deliberately misconfigured DNS.
I'd appreciate any thoughts that others have about Reggie scaling
issues.

 

Chris

SocketPermission and LookupLocatorDiscovery vs. Reggie scalability

Reply via email to