Re: Throughput slow with kernel 4.9.0

2018-09-28 Thread Brendon Colby
On Wed, Sep 26, 2018 at 10:22 PM Willy Tarreau  wrote:
> This could indicate random pauses in the hypervisor which could
> confirm the possibilities of traffic bursts I was talking about.

Ahh I see, that makes sense.

> It's just a matter of trade-off.

Definitely. For us, the degradation is unnoticeable.

> and consider yourself lucky to have saved $93/mo

Actually I misspoke there. The full build for us would cost about $600/mo.

That said, I'm really tempted to fire a server up and re-run stats. It
would be interesting to see if Tc std dev and max comes down to < 0.

-- 
Brendon Colby
Senior DevOps Engineer
Newgrounds.com



Re: Throughput slow with kernel 4.9.0

2018-09-26 Thread Willy Tarreau
Hi Brendon,

On Wed, Sep 26, 2018 at 02:45:29PM -0500, Brendon Colby wrote:
>   Tc mean: 0.59 ms
>   Tc std dev: 17.49 ms
>   Tc max: 1033.00 ms
>   Tc median: 0.00 ms

I don't know if all your servers are local, but if that's the case, the
Tc should always be very small and an stddev of 17ms and a max of 1s are
huge. This could indicate random pauses in the hypervisor which could
confirm the possibilities of traffic bursts I was talking about.

> So it looks like Tc isn't the issue here. Everything else looks good
> to my eyes. I still think something else changed, because on Jessie
> this never happened like I said.

As I said it's very possible that with a change of limit you were
slightly above the minimum required settings and now you're slightly
below after the kernel change.

> > But I also confess I've not run a VM test myself for a
> > while because each time I feel like I'm going to throw up in the middle
> > of the test :-/
> 
> I know that's always been your position on VMs (ha) but one day I
> decided to try it for myself and haven't had a single issue until now.
> Our old hardware sat nearly 100% idle most of the time, so it was hard
> to justify the expense.

Oh don't get me wrong, I know how this happens and am not complaining
about it. I'm just saying that using VMs is mostly a cost saving
solution and that if you cut costs you often have to expect a sacrifice
on something else. For those who can stand a slight degradation of
performance or latency, or spend some time chasing issues which do not
exist from time to time, that's fine. Others can't afford this at all
and will prefer bare metal. It's just a matter of trade-off.

At least if you found a way to tune your system to work around this
issue, you should simply document it somewhere for you or your coworkers
and consider yourself lucky to have saved $93/mo without degrading the
performance :-)

Willy



Re: Throughput slow with kernel 4.9.0

2018-09-26 Thread Brendon Colby
On Tue, Sep 25, 2018 at 10:51 PM Willy Tarreau  wrote:

Hi Willy,

> Just be careful as you are allocating 64GB of RAM to the TCP stack.

Yeah, after I figured that out I set it down to 1M / 4GB ram which is
enough to handle peak traffic on only one proxy VM.

> As a hint, take a look at the connection timers in your logs.

I calculated some stats from a sample of about 400K requests:

  Tw mean: 0.00 ms
  Tw std dev: 0.00 ms
  Tw max: 0.00 ms
  Tw median: 0.00 ms

  Tt mean: 296.70 ms
  Tt std dev: 4724.12 ms
  Tt max: 570127.00 ms
  Tt median: 1.00 ms

  Tr mean: 22.90 ms
  Tr std dev: 129.48 ms
  Tr max: 19007.00 ms
  Tr median: 1.00 ms

  Tc mean: 0.59 ms
  Tc std dev: 17.49 ms
  Tc max: 1033.00 ms
  Tc median: 0.00 ms

  Tq mean: 0.01 ms
  Tq std dev: 7.90 ms
  Tq max: 4980.00 ms
  Tq median: 0.00 ms

So it looks like Tc isn't the issue here. Everything else looks good
to my eyes. I still think something else changed, because on Jessie
this never happened like I said.

> But I also confess I've not run a VM test myself for a
> while because each time I feel like I'm going to throw up in the middle
> of the test :-/

I know that's always been your position on VMs (ha) but one day I
decided to try it for myself and haven't had a single issue until now.
Our old hardware sat nearly 100% idle most of the time, so it was hard
to justify the expense.

Performance on hardware is much better, I'm sure, but none of my tests
show enough of a performance boost to justify even the $93/mo servers
I was looking at renting. Properly tuned VMs have worked really well
for us.

> We're mostly saying this because everywhere on the net we find copies of
> bad values for this field, resulting in out of memory issues for those
> who blindly copy-paste them.

Yep I totally understand that. I think I was just saying that since
everyone says "never change this" there was no discussion around what
it is exactly, what it does, what happens during memory pressure mode,
how to measure if you need to change it, etc.

> Regards,
> Willy

Thanks for chiming in on this, Willy.

-- 
Brendon Colby
Senior DevOps Engineer
Newgrounds.com



Re: Throughput slow with kernel 4.9.0

2018-09-25 Thread Willy Tarreau
Hi Brendon,

On Sun, Sep 23, 2018 at 03:48:36PM -0500, Brendon Colby wrote:
(...)
> The next thing I did was to try adjusting net.ipv4.tcp_mem. This is the one
> setting almost everyone says to leave alone, that the kernel defaults are
> good enough. Well, adjusting this one setting is what seemed to fix this
> issue for us.
> 
> Here is the default values the kernel set on Devuan / Stretch:
> 
> net.ipv4.tcp_mem = 94401125868  188802
> 
> On Jessie:
> 
> net.ipv4.tcp_mem = 92394123194  184788
> 
> Here is what I set it to:
> 
> net.ipv4.tcp_mem = 16777216 16777216 16777216

Just be careful as you are allocating 64GB of RAM to the TCP stack.
However if this helps in your case, one possible explanation could be
that you're experiencing some significant latency to get out of the VM,
thus making the traffic more bursty, and are exceeding the buffers more
often.

As a hint, take a look at the connection timers in your logs. I guess
you mostly connect to servers belonging to the local network. You
should almost always see "0" as the connect time, with occasional
jumps to "1" (millisecond) due to timer resolution. When VMs exhibit
large latencies, e.g. because sub-CPUs are allocated, it's very common
to see larger values there (5-10 ms). You can be sure that if it takes
5 ms for a packet to reach another host on the local network and for
the response to come back, then someone has to buffer it during all
this time where you don't have access to the CPU, and at high bandwidth
it means that your 2+ Gbps could in fact appear as 10-20 Gbps bursts
followed by large pauses.

I have not observed any performance issue with 4.9 on hardware machines,
I'd even say that the performance is very good saturating 2 10G ports
with little CPU. But I also confess I've not run a VM test myself for a
while because each time I feel like I'm going to throw up in the middle
of the test :-/

So it might be possible that the 4.9 changes you're observing only/mostly
affect VMs. I remember about changes close to this version enabling TCP
pacing which helps a lot to avoid filling switch buffers when sending.
I also see how that may probably not improve anything in VMs which have
to share their CPU. But it should not affect Rx.

> Since almost everyone says "do NOT adjust tcp_mem" there isn't much
> documentation out there that I can find on when you SHOULD adjust this
> setting.

We're mostly saying this because everywhere on the net we find copies of
bad values for this field, resulting in out of memory issues for those
who blindly copy-paste them. It can make sense to tune it once you're
certain what you're doing (I think we still do it in our ALOHA appliances,
I'm not certain but I'm certain we used to, though we started with kernel
2.4).

Regards,
Willy



Re: Throughput slow with kernel 4.9.0

2018-09-25 Thread Brendon Colby
Hi Aaron,

On Tue, Sep 25, 2018 at 1:04 PM Aaron West  wrote:
> It seems that the Kernel developers decided to halve the default TCP
> memory in the 4.x kernels

Your colleague emailed the list about this last Nov. It was the ONLY
thing I could find on this matter anywhere and was helpful in pointing
me the right way.

The crazy thing is that I doubled those numbers to what they were in
Jessie and we still had slow downloads. I think this was because
memory pressure mode was still being reached. Something else must have
changed because I never touched tcp_mem prior to this and have never
seen this sort of thing happen before.

> Simply decide if you need to increase it by looking out for
> the error message:

Normally you won't ever see an error message, at least not in my
experience. That's what was so frustrating about this.

Once the middle value (pressure) is reached, the kernel appears to
start throttling connections somehow (I think it starts to reduce the
max buffer size that can be allocated per connection). Nothing is ever
reported in the logs about this.

Only when you set the three values for tcp_mem the same will you see
the error message (at least the pressure and high values).

> Anyway, just thought I'd mention it for info and to say you are not alone ;)

Thanks, I appreciate it!

-- 
Brendon Colby
Senior DevOps Engineer
Newgrounds.com



Re: Throughput slow with kernel 4.9.0

2018-09-25 Thread Aaron West
Hi Brendon,

I just wanted to reach out and say that we found this too!

It seems that the Kernel developers decided to halve the default TCP
memory in the 4.x kernels, it probably makes sense for most
applications but not when dealing with busy high network usage like we
typically see when acting as a load balancer and/or reverse proxy.

The actual change is mentioned here:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=b66e91ccbc34ebd5a2f90f9e1bc1597e2924a500

For me reducing it by 50% didn't work well... So I wrote a script to
simply double TCP memory if a newer Kernel is detected as I knew it
was reduced by 50% from what I had been used to and it always worked
for me on the old defaults. However, your method is better(Less
lazy)... Simply decide if you need to increase it by looking out for
the error message:

TCP: out of memory -- consider tuning tcp_mem

Anyway, just thought I'd mention it for info and to say you are not alone ;)

Aaron West

Loadbalancer.org Ltd.
www.loadbalancer.org



Throughput slow with kernel 4.9.0

2018-09-23 Thread Brendon Colby
Greetings,

Similar to this user:

https://www.mail-archive.com/haproxy@formilux.org/msg27698.html

I recently upgraded our proxy VMs from Debian 8/Jessie (kernel 3.16.0) to
Devuan 2/ASCII (Debian Stretch w/o systemd, kernel 4.9.0). I know running
haproxy on a VM is often discouraged, but we have done so for years with
great success.

Right now I'm stress testing the new build on ONE proxy VM doing 861 req/s,
2.26 Gbps outbound traffic, 70k pps in, 90k pps out with quite a bit of
capacity to spare. It can be done with some tweaking but nothing much
outside of what would have to be done on hardware.

Our VM hosts are have one Xeon E5-2687W v4 processor (12 core, 24 logical),
256GB ram, and dual Intel 10G adapters, one for external traffic, one for
internal traffic. I have the proxy VMs configured with 8 cores, 8GB ram,
and two virtio adapters both with multi-queue set to 2 (which gives me two
receive queues per adapter). We're running Proxmox 5.

haproxy is a custom build of 1.8.14 built with:

make TARGET=linux2628 USE_PCRE=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_ZLIB=1
USE_FUTEX=1

I have each receive queue pinned to a different processor (0 - 3). haproxy
is configured with nbproc 4 and pinned to procs 4 - 7. iptables with
connection tracking is enabled (I couldn't see ANY performance benefits
from using a stateless firewall).

I can get near wire speeds between VM hosts as well as between VM guests on
the local network.

The problem we saw right away was when any amount of traffic was flowing
through these new proxy builds, single stream throughput would be severely
reduced. Without load, I could pull down a file at 200+ Mbps with a single
stream. With load, that would drop to 10-15 Mbps if that.

This meant that 1080p videos would endlessly buffer and large images would
load like they did in the 90s on dial-up. Not good.

After a bunch of trial and error, I narrowed the issue down to the network
layer itself. The only thing I could find that may have pointed to what was
going on was this:

# netstat -s | grep buffer
16889843 packets pruned from receive queue because of socket buffer
overrun
7626 packets dropped from out-of-order queue because of socket buffer
overrun
3912652 packets collapsed in receive queue due to low socket buffer

These values were incrementing a lot faster than on the old build. My
research on this pointed to w/rmem settings, which I've never adjusted
before because most recommendations seem to be to leave these alone. Plus I
could never determine that we actually needed to adjust these.

Here are the sysctl settings we've been using for years:

vm.swappiness=10
net.ipv4.tcp_tw_reuse=1
net.ipv4.ip_local_port_range=1024 65535
net.core.somaxconn=10240
net.core.netdev_max_backlog=10240
net.ipv4.conf.all.rp_filter=1
net.ipv4.tcp_max_syn_backlog=10240
net.ipv4.tcp_synack_retries=3
net.ipv4.tcp_syncookies=1
net.netfilter.nf_conntrack_max=4194304

After doing a TON of research, I decided to adjust the r/wmem settings.
>From here:

http://fasterdata.es.net/host-tuning/linux/

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20Central/page/Linux%20System%20Tuning%20Recommendations

I settled on the following:

# allow testing with buffers up to 128MB
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
# increase Linux autotuning TCP buffer limit to 64MB
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

Which is good "For a host with a 10G NIC optimized for network paths up to
200ms RTT, and for friendliness to single and parallel stream tools..."
which seemed fine for us.

However, these settings didn't make any difference.

The next thing I did was to try adjusting net.ipv4.tcp_mem. This is the one
setting almost everyone says to leave alone, that the kernel defaults are
good enough. Well, adjusting this one setting is what seemed to fix this
issue for us.

Here is the default values the kernel set on Devuan / Stretch:

net.ipv4.tcp_mem = 94401125868  188802

On Jessie:

net.ipv4.tcp_mem = 92394123194  184788

Here is what I set it to:

net.ipv4.tcp_mem = 16777216 16777216 16777216

I can create the low throughput issue by changing tcp_mem back to the
defaults. I'm not even sure the other settings are necessary (still testing
that).

Can anyone shed some light on why adjusting tcp_mem fixed this? Are the
other settings needed / appropriate? I'm not fond of deploying anything
into production with settings I've copied from the internet without fully
understanding what I'm doing. Most posts on this only copy the kernel docs
verbatim.

Since almost everyone says "do NOT adjust tcp_mem" there isn't much
documentation out there that I can find on when you SHOULD adjust this
setting.

All I know is that by changing tcp_mem I can run an iperf test and get over
1 Gbps even with site traffic being over 2 Gbps (we have 3 Gbps available).
File downloads are