Re: [kubernetes-users] Cluster DNS: bottleneck with ~1000 outbound connections per second

Evan Jones Thu, 05 Oct 2017 13:29:38 -0700

The sustained 1000 qps comes from an application making that many outbound
connections. I agree that the application is very inefficient and shouldn't
be doing a DNS lookup for every request it sends, but it's a python program
that uses urllib2.urlopen so it creates a new connection each time. I
suspect this isn't that unusual? This could be a server that hits an
external service for every user request, for example. Given the activity on
the GitHub issues I linked, it appears I'm not the only person to have run
into this.


Thanks for the response though, since that answers my question: there is
currently no plans to change how this works. Hopefully if anyone else hits
this they might find this email so they can solve it faster than I did.

Finally the fact that dnsPolicy: Default is *not* the default is also
surprising. It should probably be called dnsPolicy: Host or something
instead.


On Oct 5, 2017 13:54, "'Tim Hockin' via Kubernetes user discussion and Q&A"
<kubernetes-users@googlegroups.com> wrote:

> We had a proposal to avoid conntrack for DNS, but no real movement on it.
>
> We have flags to adjust the conntrack table size.
>
> Kernel has params to tweak timeouts, which users can tweak.
>
> Sustained 1000 QPS DNS seems artificial.
>
> On Thu, Oct 5, 2017 at 10:47 AM, Evan Jones <evan.jo...@bluecore.com>
> wrote:
> > TL;DR: Kubernetes dnsPolicy: ClusterFirst can become a bottleneck with a
> > high rate of outbound connections. It seems like the problem is filling
> the
> > nf_conntrack table, causing client applications to fail to do DNS
> lookups. I
> > resolved this problem by switching my application to dnsPolicy: Default,
> > which provided much better performance for my application that does not
> need
> > cluster DNS.
> >
> > It seems like this is probably a "known" problem (see issues below), but
> I
> > can't tell: Is there a solution being worked on for this?
> >
> > Thanks!
> >
> >
> > Details:
> >
> > We were running a load generator, and were surprised to find that the
> > aggregate rate did not increase as we added more instances and nodes to
> our
> > cluster (GKE 1.7.6-gke.1). Eventually the application started getting
> errors
> > like "Name or service not known" at surprisingly low rates, like ~1000
> > requests/second. Switching the application to dnsPolicy: Default resolved
> > the issue.
> >
> > I spent some time digging into this, and the problem is not the CPU
> > utilization kube-dns / dnsmasq itself. On my small cluster of ~10
> > n1-standard-1 instances, I can get about 80000 cached DNS
> queries/second. I
> > *think* the issue is that when there are enough machines talking to this
> > single DNS server, it fills the nf_conntrack table, causing packets to
> get
> > dropped, which I believe ends up rate limiting the clients. dmesg on the
> > node that is running kube-dns shows a constant stream of:
> >
> > [1124553.016331] nf_conntrack: table full, dropping packet
> > [1124553.021680] nf_conntrack: table full, dropping packet
> > [1124553.027024] nf_conntrack: table full, dropping packet
> > [1124553.032807] nf_conntrack: table full, dropping packet
> >
> > It seems to me that this is a bottleneck for Kubernetes clusters, since
> by
> > default all queries are directed to a small number of machines, which
> will
> > then fill the connection tracking tables.
> >
> > Is there a planned solution to this bottleneck? I was very surprised that
> > *DNS* would be my bottleneck on a Kubernetes cluster, and at shockingly
> low
> > rates.
> >
> >
> > Related Github issues
> >
> > The following Github issues may be related to this problem. They all
> have a
> > bunch of discussion but no clear resolution:
> >
> > Run kube-dns on each node:
> > https://github.com/kubernetes/kubernetes/issues/45363
> > Run dnsmasq on each node; mentions conntrack:
> > https://github.com/kubernetes/kubernetes/issues/32749
> > kube-dns should be a daemonset / run on each node
> > https://github.com/kubernetes/kubernetes/issues/26707
> >
> > dnsmasq intermittent connection refused:
> > https://github.com/kubernetes/kubernetes/issues/45976
> > Intermitted DNS to external name:
> > https://github.com/kubernetes/kubernetes/issues/47142
> >
> > kube-aws seems to already do something to run a local DNS resolver on
> each
> > node? https://github.com/kubernetes-incubator/kube-aws/pull/792/
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Kubernetes user discussion and Q&A" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to kubernetes-users+unsubscr...@googlegroups.com.
> > To post to this group, send email to kubernetes-users@googlegroups.com.
> > Visit this group at https://groups.google.com/group/kubernetes-users.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Kubernetes user discussion and Q&A" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/kubernetes-users/7JBq6jhMZHc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> kubernetes-users+unsubscr...@googlegroups.com.
> To post to this group, send email to kubernetes-users@googlegroups.com.
> Visit this group at https://groups.google.com/group/kubernetes-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Kubernetes user discussion and Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to kubernetes-users+unsubscr...@googlegroups.com.
To post to this group, send email to kubernetes-users@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.

Re: [kubernetes-users] Cluster DNS: bottleneck with ~1000 outbound connections per second

Reply via email to