c-taylor opened a new issue #7375: URL: https://github.com/apache/trafficserver/issues/7375
As parents of a topology are marked down under load, 'HostStatus::getHostStatus' can cause excessive lock behaviour resulting in high system time, reduced output and stats holes. When performing failure testing: Overloading configured parents causes lock contention on the stats storage. It was possible to consume almost all ET_NET thread time with a few failing parents and fewer than 5,000 RPS. ### Fault replication Increase load through an edge -> parent configuration until the parents start to fail. I used connection limits as the failure trigger as it was predictable to fail. ### Observations As parents fail there is an increase in 'HostStatus::getHostStatus' contention, especially when the last parent fails. This causes a reduction in all 'good' work, errors to clients, content already in cache. 1. perf traces and flame graphs show near 100% system consumption on lock activity. <img width="505" alt="getHostStaus_crop" src="https://user-images.githubusercontent.com/12032425/101389970-10b06400-38ba-11eb-8ccb-8bc829c63814.png"> 2. traffic_server metrics stop updating 3. Response and data rates drop ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
