[GitHub] [trafficserver] rob05c commented on issue #7324: Video streaming issue on traffic server

GitBox Mon, 22 Mar 2021 15:08:02 -0700


rob05c commented on issue #7324:
URL: https://github.com/apache/trafficserver/issues/7324#issuecomment-804427594

`loadavg` is a very broad metric, which includes CPU, memory wait, disk
wait, disk usage, and potentially other things.
Can you look at specific metrics on your system, and see what specific
things have high load? Is it just CPU? Memory? Disk io_wait? All of the above?

ATS will use as much memory as you tell it to. You can allocate ramdisks and
give them to ATS as block devices. Each disk given to ATS also has a memory
cache in front of it, the size of which is configurable.

See:

https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#ram-cache

https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/storage.config.en.html

https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/volume.config.en.html

ATS does have some known memory leaks, but they're generally pretty small.
It shouldn't use much more memory than what you allocated for storage and
ram_cache, and the memory shouldn't grow much over time.

Many people run ATS in production with bandwidth much higher than 3Gbps. My
company has caches doing in excess of 20Gbps. If you're having trouble
achieving those speeds, another possibility is Linux Kernel Parameters. It's
common to have to do a lot of tuning of Linux Kernel Parameters to achieve high
performance. Though I wouldn't expect a great deal of tuning to be necessary
under 10Gbps.

I assume this is somewhat recent hardware, with decent CPUs? We do have some
Prod servers that struggle to exceed 10Gbps, from underpowered CPUs with few
PCI lanes. Platforms with too few PCI lanes can also cause network bottlenecks
like that.

If your high load average is being caused by disk io_wait, are you certain
your SSDs are fast? Some SSD brands have poor performance. It may be worth
testing their sequential and random speeds, just to be sure that isn't the
problem.

> traffic server's error log is increasing as following
20201115.16h42m02s CONNECT:[0] could not connect [CONNECTION_CLOSED] to
127.0.0.1 for
'http://localhost/vod/encrypt/prod/8a01918b72167aad01722a8007db243e/8a01918b72167aad01722a8007db243e_1500_2/media-88320000.mp4'

I'm not sure I understand. Your initial question is about bandwidth
bottlenecks and high loadavg, but this looks like an error? This looks like an
origin (on localhost?) is misconfigured, or unable to handle the requests or
load?

Are you saying you see a lot of these errors as you approach 3Gbps? That
sounds like the Origin server isn't able to handle the load, that the problem
is with the Origin, not ATS. Can you verify your Origin itself is capable of
the request load?

Are these requests mostly Cache Hits or Misses? For a CDN, ATS should be
caching I assume. Is the full traffic going to the Origin? Could that be
causing the problem? Maybe the Origin can't handle the full 3Gbps, because
everything is a Cache Miss, and you need to set Cache-Control to make ATS cache
the content so the Origin can handle it.

Certain SSL Certificates can also cause high CPU usage, especially RSA. Are
you using HTTPS?

In short, there are a huge number of factors that can cause bottlenecks like
you're seeing. You'll have to narrow it down further, and inspect your hardware
usage to figure out what the bottleneck is, and how to fix it. But ATS can
definitely do +20Gbps, potentially even 100Gbps, and many large corporations
are doing so in Production.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [trafficserver] rob05c commented on issue #7324: Video streaming issue on traffic server

Reply via email to