If you'd like to see how this affects things, then you can break bind configuration on a server running reggie to see how "reverse-DNS" delays and other such things that happen repeatedly will affect your service's performance. A simple thing, is to just remove the contents of /etc/resolv.conf and put in there addresses of servers that don't exist to make sure that there is nothing bind can find to "talk" to. Then empty out /etc/hosts as well (except for localhost, and the host machines name and address) to make sure there's nothing for it to cheat from in there.

Then, write a test service and client, that interact via lookup, followed by getting a lease, making a remote call from the client to the service and then canceling the lease. Run this in a loop indefinitely, and log the times between calls. When you've got it correctly "broken", you will see many seconds between each call, indicating huge delays as the internal java security implementation tries to perform reverse DNS on the clients inbound socket address.

There may be some additional things about the policy file contents that can extend the delay based on there being multiple host grants etc.

What happens on failure is that Java caches the DNS lookup failure for 10 seconds. So, 10 seconds later, the DNS lookup will have to be done again starting with the failing host.

This is another one of those things which people can experience as a negative impact on their Jini "first experience". They'd look at the response times and say, crap, at this speed, I can just use paper and people on bikes to get things done faster!

Gregg Wonderly

On 3/15/2011 12:01 PM, Tom Hobbs wrote:
I've not experienced this issue myself.  It's an interesting one, and
Gregg's response is also intriguing.

I know it's not that helpful to you, but I'll see what I can do about
including something about this on the River site or wiki.

Chris, if you feel this is an issue that River can/should solve then
please create a Jira for it otherwise it'll get lost in the mists of
time.

On Tue, Mar 15, 2011 at 4:42 PM, Christopher Dolan
<[email protected]>  wrote:
Understood, increasing that value to something large would make me just
suffer that timeout once per remote machine per reboot. Is this the
solution most River users have employed, or have most of you simply
never had to deal with this problem? In my case, I may connect to
hundreds of remote machines via an app that wants a short startup time,
so this solution concerns me.

Chris

-----Original Message-----
From: Gregg Wonderly [mailto:[email protected]]
Sent: Sunday, March 13, 2011 9:08 AM
To: [email protected]
Subject: Re: reverse DNS timeouts and SocketPermission

Dns failure ttl change is the most useful way to deal with this. 10
seconds is the default and a failing dns query will be longer than that.
So every use of the name will result in a new attempt to lookup the same
thing on the same failing server

Gregg

Sent from my iPhone

On Mar 10, 2011, at 3:06 PM, "Christopher Dolan"
<[email protected]>  wrote:

The java.net.SocketPermission class uses forward and reverse DNS
lookups
to ensure that we're allowed to talk to particular remote machines.
These lookups are used to canonicalize a remote host's name to ensure
that variations in that name don't lead to false negatives.

However, many people have found
(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4975882) that if
there are configuration errors in a DNS system, the reverse DNS
failures
cause very significant latency (e.g. I've seen 10-12 seconds). This
latency has widely varying affects on a djinn. In many cases, it just
causes LookupCache slowdowns which can be mitigated by delayed
deserialization techniques discussed previously on the dev@ mailing
list. But in some cases, I've seen it cause Reggie to hang up for a
while (I still don't understand where in Reggie the problem occurs,
maybe EventListeners?)

Obviously, the real solution is to properly configure DNS. But I would
like to know how other people have addressed this issue in their
deployments.

* Do you ensure the RMI codebase URLs all use canonical hostnames, or
IP addresses?
* Do you ensure that the TcpServerEndpoint has a consistent (perhaps
hard-coded) name?
* Do you have monitoring or logging code to proactively detect DNS
configuration errors?
* Do you fiddle the Java security property
"networkaddress.cache.negative.ttl"?
* Do you use host files?
* Do you use a non-Sun JVM?
* Do you use wildcards or IP addresses in your security policy file?
* Do you completely disable the socket check in your security policy
file? (yikes!)
* Have you simply never seen this problem?  (lucky you!)

Thanks,
Chris




Reply via email to