For web searchers from the future, I found the patch attached to this defect: https://issues.apache.org/jira/browse/RIVER-336 See Gregg's 02/Apr/10 comment on that issue. Thanks, Peter and Gregg. Chris
-----Original Message----- From: Peter Firmstone [mailto:[email protected]] Sent: Wednesday, April 13, 2011 1:18 AM To: [email protected] Subject: Re: SocketPermission and LookupLocatorDiscovery vs. Reggie scalability From memory, Gregg provided the code with his CodebaseAccessClassLoader patch. Christopher Dolan wrote: > Wow, Gregg, it seems like every problem I bring up is one you've already > solved! Do you have a patch available that I can test? > > Digressingly, I see that sun.rmi.server.LoaderHandler has the exact same > locking issue because it seems to share a lot of code with > PreferredClassProvider. Just as a passing point of curiosity, I wonder > which one came first? > > Chris > > -----Original Message----- > From: Gregg Wonderly [mailto:[email protected]] > Sent: Tuesday, April 12, 2011 10:50 AM > To: [email protected] > Subject: Re: SocketPermission and LookupLocatorDiscovery vs. Reggie > scalability > > I found the problem with the global lock some time ago and mentioned in > on the Jini-Users list I believe. I made changes my self after meeting > no real interest in solving the problem to use a finer grain locking > strategy and that does work to tremendously reduce the contention at > that point. This allows non-broken classloading to go on when a class > loader is slow to respond or it's DNS is slow to respond. > > Gregg Wonderly > > On Apr 11, 2011, at 9:59 AM, Christopher Dolan wrote: > > >> I recently found the root cause of a long-standing performance problem >> with Reggie that we've suffered for years. Our djinns may have 10,000 >> services registered, so when Reggie boots up cold it gets slammed with >> thousands of TCP requests via LookupLocatorDiscovery, >> JoinManager.register() and ServiceDiscoveryManager.lookup(). In >> > theory, > >> this should be supportable because Reggie's read/write priority lock >> > is > >> pretty efficient, but two big technical complications have harmed our >> ability to scale: >> >> >> >> 1) PreferredClassProvider.lookupLoader() has a global lock. Behind >> that lock, URLClassLoaders are built which may trigger >> > SocketPermission > >> checks. That SocketPermission causes a reverse DNS lookup in >> getCanonName() because of the default Sun JRE lib/security/java.policy >> line: >> >> permission java.net.SocketPermission "localhost:1024-", "listen"; >> >> Because PolicyFile.add() prepends, this check is evaluated first even >> > if > >> you have local permissions that are more liberal. A handful of clients >> with bad DNS configurations can cause long timeouts that stall the >> > whole > >> process, causing eventual OutOfMemoryErrors because requests arrive >> faster than they can be fulfilled. >> >> >> >> Possible code solutions (aside from fixing DNS configuration, of >> course): >> >> a) switch PreferredClassProvider to a finer-grained lock (use the >> > global > >> lock to lookup the fine lock, and only hold the fine lock while doing >> creating the class loaders) >> >> b) defer some of the class loader construction so the DNS lookups >> > happen > >> after the PreferredClassProvider lock is released >> >> c) implement a replacement for SocketPermission and/or >> PermissionCollection which is smarter about the order it checks >> permissions to minimize the number DNS lookups >> >> >> >> 2) When Reggie shuts down and then restarts, it accidentally >> synchronizes all of the remote LookupLocatorDiscovery, who may restart >> their polling WakeupManagers at the same time. What we see is that >> several thousand TCP connections are all initiated within a few >> > seconds > >> of each other, despite the LookupLocatorDiscovery.LocatorReg.sleepTime >> values. When/if these unicast connections succeed, then we see >> > thousands > >> more TCP connections from JoinManager hitting Reggie in a giant wave. >> In VisualVM's performance graphs, I see Reggie go from 100 threads to >> 3000 threads in a couple of seconds, for example. >> >> >> >> Possible code solutions: >> >> a) add a random nudge to the polling interval in >> > LookupLocatorDiscovery, > >> like the unicastDelayRange in the LocatorDiscovery class. This would >> gradually desynchronize the clients >> >> b) likewise for JoinManager, perhaps >> >> >> >> >> >> These conditions are hard to reproduce in a typical lab, because they >> require large numbers of machines and deliberately misconfigured DNS. >> I'd appreciate any thoughts that others have about Reggie scaling >> issues. >> >> >> >> Chris >> >> > > >
