Re: [kubernetes-users] Cluster DNS: bottleneck with ~1000 outbound connections per second

Jimmi Dyson Tue, 20 Mar 2018 04:50:53 -0700

Remembered seeing this on Twitter from last week (
https://twitter.com/bboreham/status/973871688495652865):


"PSA: In #Kubernetes <https://twitter.com/hashtag/Kubernetes?src=hash> use
absolute DNS names not relative, where possible - put a dot at the end of
the name. Cuts DNS lookups by 5x.I.e. instead of "http://example.com
<https://t.co/IToGrHJxuK>" put "http://example.com <https://t.co/IToGrHJxuK>
."

On Tue, 20 Mar 2018 at 11:44 Evan Jones <evan.jo...@bluecore.com> wrote:

> The downside that I am aware of is that you don't get the Kubernetes DNS
> magic, where names automatically point to your services. For the particular
> use case where I ran into this, it worked perfectly!
>
> I was also going to attempt to add an alias so we could eventually migrate
> to dnsPolicy: Host instead of the confusingly named Default, but it seemed
> challenging enough that I never got around to it.
>
> Evan
>
>
> On Tue, Mar 20, 2018 at 1:55 AM, <m...@percy.io> wrote:
>
>> On Thursday, October 5, 2017 at 1:29:28 PM UTC-7, Evan Jones wrote:
>> > The sustained 1000 qps comes from an application making that many
>> outbound connections. I agree that the application is very inefficient and
>> shouldn't be doing a DNS lookup for every request it sends, but it's a
>> python program that uses urllib2.urlopen so it creates a new connection
>> each time. I suspect this isn't that unusual? This could be a server that
>> hits an external service for every user request, for example. Given the
>> activity on the GitHub issues I linked, it appears I'm not the only person
>> to have run into this.
>> >
>> >
>> > Thanks for the response though, since that answers my question: there
>> is currently no plans to change how this works. Hopefully if anyone else
>> hits this they might find this email so they can solve it faster than I did.
>> >
>> >
>> > Finally the fact that dnsPolicy: Default is *not* the default is also
>> surprising. It should probably be called dnsPolicy: Host or something
>> instead.
>> >
>> >
>> >
>> >
>> >
>> > On Oct 5, 2017 13:54, "'Tim Hockin' via Kubernetes user discussion and
>> Q&A" <kubernet...@googlegroups.com> wrote:
>> > We had a proposal to avoid conntrack for DNS, but no real movement on
>> it.
>> >
>> >
>> >
>> > We have flags to adjust the conntrack table size.
>> >
>> >
>> >
>> > Kernel has params to tweak timeouts, which users can tweak.
>> >
>> >
>> >
>> > Sustained 1000 QPS DNS seems artificial.
>> >
>> >
>> >
>> > On Thu, Oct 5, 2017 at 10:47 AM, Evan Jones <evan....@bluecore.com>
>> wrote:
>> >
>> > > TL;DR: Kubernetes dnsPolicy: ClusterFirst can become a bottleneck
>> with a
>> >
>> > > high rate of outbound connections. It seems like the problem is
>> filling the
>> >
>> > > nf_conntrack table, causing client applications to fail to do DNS
>> lookups. I
>> >
>> > > resolved this problem by switching my application to dnsPolicy:
>> Default,
>> >
>> > > which provided much better performance for my application that does
>> not need
>> >
>> > > cluster DNS.
>> >
>> > >
>> >
>> > > It seems like this is probably a "known" problem (see issues below),
>> but I
>> >
>> > > can't tell: Is there a solution being worked on for this?
>> >
>> > >
>> >
>> > > Thanks!
>> >
>> > >
>> >
>> > >
>> >
>> > > Details:
>> >
>> > >
>> >
>> > > We were running a load generator, and were surprised to find that the
>> >
>> > > aggregate rate did not increase as we added more instances and nodes
>> to our
>> >
>> > > cluster (GKE 1.7.6-gke.1). Eventually the application started getting
>> errors
>> >
>> > > like "Name or service not known" at surprisingly low rates, like ~1000
>> >
>> > > requests/second. Switching the application to dnsPolicy: Default
>> resolved
>> >
>> > > the issue.
>> >
>> > >
>> >
>> > > I spent some time digging into this, and the problem is not the CPU
>> >
>> > > utilization kube-dns / dnsmasq itself. On my small cluster of ~10
>> >
>> > > n1-standard-1 instances, I can get about 80000 cached DNS
>> queries/second. I
>> >
>> > > *think* the issue is that when there are enough machines talking to
>> this
>> >
>> > > single DNS server, it fills the nf_conntrack table, causing packets
>> to get
>> >
>> > > dropped, which I believe ends up rate limiting the clients. dmesg on
>> the
>> >
>> > > node that is running kube-dns shows a constant stream of:
>> >
>> > >
>> >
>> > > [1124553.016331] nf_conntrack: table full, dropping packet
>> >
>> > > [1124553.021680] nf_conntrack: table full, dropping packet
>> >
>> > > [1124553.027024] nf_conntrack: table full, dropping packet
>> >
>> > > [1124553.032807] nf_conntrack: table full, dropping packet
>> >
>> > >
>> >
>> > > It seems to me that this is a bottleneck for Kubernetes clusters,
>> since by
>> >
>> > > default all queries are directed to a small number of machines, which
>> will
>> >
>> > > then fill the connection tracking tables.
>> >
>> > >
>> >
>> > > Is there a planned solution to this bottleneck? I was very surprised
>> that
>> >
>> > > *DNS* would be my bottleneck on a Kubernetes cluster, and at
>> shockingly low
>> >
>> > > rates.
>> >
>> > >
>> >
>> > >
>> >
>> > > Related Github issues
>> >
>> > >
>> >
>> > > The following Github issues may be related to this problem. They all
>> have a
>> >
>> > > bunch of discussion but no clear resolution:
>> >
>> > >
>> >
>> > > Run kube-dns on each node:
>> >
>> > > https://github.com/kubernetes/kubernetes/issues/45363
>> >
>> > > Run dnsmasq on each node; mentions conntrack:
>> >
>> > > https://github.com/kubernetes/kubernetes/issues/32749
>> >
>> > > kube-dns should be a daemonset / run on each node
>> >
>> > > https://github.com/kubernetes/kubernetes/issues/26707
>> >
>> > >
>> >
>> > > dnsmasq intermittent connection refused:
>> >
>> > > https://github.com/kubernetes/kubernetes/issues/45976
>> >
>> > > Intermitted DNS to external name:
>> >
>> > > https://github.com/kubernetes/kubernetes/issues/47142
>> >
>> > >
>> >
>> > > kube-aws seems to already do something to run a local DNS resolver on
>> each
>> >
>> > > node? https://github.com/kubernetes-incubator/kube-aws/pull/792/
>> >
>> > >
>> >
>> > > --
>> >
>> > > You received this message because you are subscribed to the Google
>> Groups
>> >
>> > > "Kubernetes user discussion and Q&A" group.
>> >
>> > > To unsubscribe from this group and stop receiving emails from it,
>> send an
>> >
>> > > email to kubernetes-use...@googlegroups.com.
>> >
>> > > To post to this group, send email to kubernet...@googlegroups.com.
>> >
>> > > Visit this group at https://groups.google.com/group/kubernetes-users.
>> >
>> > > For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> >
>> > --
>> >
>> > You received this message because you are subscribed to a topic in the
>> Google Groups "Kubernetes user discussion and Q&A" group.
>> >
>> > To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/kubernetes-users/7JBq6jhMZHc/unsubscribe
>> .
>> >
>> > To unsubscribe from this group and all its topics, send an email to
>> kubernetes-use...@googlegroups.com.
>> >
>> > To post to this group, send email to kubernet...@googlegroups.com.
>> >
>> > Visit this group at https://groups.google.com/group/kubernetes-users.
>> >
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> Evan,
>>
>> This post was very helpful. We've hit this exact same issue in our
>> Kubernetes cluster where we make a lot of outbound connections.
>>
>> Did you find any downsides with setting "dnsPolicy: Default" and did you
>> end up sticking with that as the solution?
>>
>> Cheers,
>> Mike
>>
>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Kubernetes user discussion and Q&A" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/kubernetes-users/7JBq6jhMZHc/unsubscribe
>> .
>>
> To unsubscribe from this group and all its topics, send an email to
>> kubernetes-users+unsubscr...@googlegroups.com.
>>
>
>> To post to this group, send email to kubernetes-users@googlegroups.com.
>> Visit this group at https://groups.google.com/group/kubernetes-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Kubernetes user discussion and Q&A" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-users+unsubscr...@googlegroups.com.
> To post to this group, send email to kubernetes-users@googlegroups.com.
> Visit this group at https://groups.google.com/group/kubernetes-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Kubernetes user discussion and Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to kubernetes-users+unsubscr...@googlegroups.com.
To post to this group, send email to kubernetes-users@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.

Re: [kubernetes-users] Cluster DNS: bottleneck with ~1000 outbound connections per second

Reply via email to