The logging from dnsmasq was insightful. It looks like the lookups favor the second server in the list. In my case, the second server was for quick offsite lookups so it was failing the local lookups.
From: Brigman, Larry Sent: Thursday, February 22, 2018 3:57 PM To: Clayton Coleman Cc: users@lists.openshift.redhat.com Subject: RE: DNS lookup failures I hadn't tried that. I did turn on dnsmasq logging of queries to help pinpoint the problem. One of the issues was outside of Openshift where the DNS wouldn't forward requests and just time out. Getting that out of the system was a multi-step process plus restarting dnsmasq after changing the config files to get it to pick up the correct DNS servers. When it occurs again, I'll use dig against the local resolver where I'm getting the failure. On a multi-node cluster, changing all the files is painful. From: Clayton Coleman [ccole...@redhat.com] Sent: Thursday, February 22, 2018 2:58 PM To: Brigman, Larry Cc: users@lists.openshift.redhat.com Subject: Re: DNS lookup failures Do you see errors when you try to dig the master DNS address? Or if you dig the local dnsmasq? I wonder if we're caching a negative lookup or soemithng similar. On Wed, Feb 21, 2018 at 6:01 PM, Brigman, Larry <larry.brig...@arris.com> wrote: I have been experiencing DNS lookup failures. This is preventing production deployment of Openshift. I see it in two cases, lookup of a remote docker registry and lookup of a ldap service. Both of these are not local to the server(s) in question but local to internal DNS servers. The ldap case is easier for me to replicate as I just need to attempt to login. Message: Feb 20 11:21:16 lab-stack1 atomic-openshift-master-api: E0220 11:21:16.924930 2005 login.go:176] Error authenticating "XXXX" with provider "ldap": LDAP Result Code 200 "": dial tcp: lookup ldap.xxx.xxx on xxx.xxx.xxx.xxx:53: no such host Officiated the user, provider name and host for security. On xxx.xxx.xxx.xxx:53 is the master node which is running dnsmasq with the default configuration provided via openshift-ansible installation. These get resolved for a while if I go on a host and do ‘host ldap.xxx.xxx’. It then works for a while and then reverts. oc version oc v3.7.0+7ed6862 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://lab-stack1.lab.c-cor.com:8443 openshift v3.7.0+7ed6862 kubernetes v1.7.6+a08f5eeb62 What are the next steps to try. Using dig or host on the node in question always returns a valid lookup result. _______________________________________________ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users _______________________________________________ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users