rob05c commented on issue #7324:
URL: https://github.com/apache/trafficserver/issues/7324#issuecomment-804427594


   `loadavg` is a very broad metric, which includes CPU, memory wait, disk 
wait, disk usage, and potentially other things. 
   Can you look at specific metrics on your system, and see what specific 
things have high load? Is it just CPU? Memory? Disk io_wait? All of the above?
   
   ATS will use as much memory as you tell it to. You can allocate ramdisks and 
give them to ATS as block devices. Each disk given to ATS also has a memory 
cache in front of it, the size of which is configurable.
   
   See:
   
https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#ram-cache
   
https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/storage.config.en.html
   
https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/volume.config.en.html
   
   ATS does have some known memory leaks, but they're generally pretty small. 
It shouldn't use much more memory than what you allocated for storage and 
ram_cache, and the memory shouldn't grow much over time.
   
   Many people run ATS in production with bandwidth much higher than 3Gbps. My 
company has caches doing in excess of 20Gbps. If you're having trouble 
achieving those speeds, another possibility is Linux Kernel Parameters. It's 
common to have to do a lot of tuning of Linux Kernel Parameters to achieve high 
performance. Though I wouldn't expect a great deal of tuning to be necessary 
under 10Gbps. 
   
   I assume this is somewhat recent hardware, with decent CPUs? We do have some 
Prod servers that struggle to exceed 10Gbps, from underpowered CPUs with few 
PCI lanes. Platforms with too few PCI lanes can also cause network bottlenecks 
like that.
   
   If your high load average is being caused by disk io_wait, are you certain 
your SSDs are fast? Some SSD brands have poor performance. It may be worth 
testing their sequential and random speeds, just to be sure that isn't the 
problem.
   
   > traffic server's error log is increasing as following
   20201115.16h42m02s CONNECT:[0] could not connect [CONNECTION_CLOSED] to 
127.0.0.1 for 
'http://localhost/vod/encrypt/prod/8a01918b72167aad01722a8007db243e/8a01918b72167aad01722a8007db243e_1500_2/media-88320000.mp4'
   
   I'm not sure I understand. Your initial question is about bandwidth 
bottlenecks and high loadavg, but this looks like an error? This looks like an 
origin (on localhost?) is misconfigured, or unable to handle the requests or 
load?
   
   Are you saying you see a lot of these errors as you approach 3Gbps? That 
sounds like the Origin server isn't able to handle the load, that the problem 
is with the Origin, not ATS. Can you verify your Origin itself is capable of 
the request load?
   
   Are these requests mostly Cache Hits or Misses? For a CDN, ATS should be 
caching I assume. Is the full traffic going to the Origin? Could that be 
causing the problem? Maybe the Origin can't handle the full 3Gbps, because 
everything is a Cache Miss, and you need to set Cache-Control to make ATS cache 
the content so the Origin can handle it.
   
   Certain SSL Certificates can also cause high CPU usage, especially RSA. Are 
you using HTTPS?
   
   In short, there are a huge number of factors that can cause bottlenecks like 
you're seeing. You'll have to narrow it down further, and inspect your hardware 
usage to figure out what the bottleneck is, and how to fix it. But ATS can 
definitely do +20Gbps, potentially even 100Gbps, and many large corporations 
are doing so in Production.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to