zhaorongsheng opened a new issue, #63358: URL: https://github.com/apache/doris/issues/63358
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version Doris BE 2.1.x. ### What's Wrong? After a group of BE nodes was permanently removed from the cluster (DROP BACKEND on the FE, the machines were shut down, and their DNS A/PTR records were deleted), every surviving BE in the same cluster keeps logging two kinds of WARNING forever: Symptom A — DNSCache refresh thread floods be.WARNING W<date> <ts> <tid> network_util.cpp:115] failed to get ip from host: be-old-1.example.com err: Name or service not known W<date> <ts> <tid> status.h:415] meet error status: [INTERNAL_ERROR]failed to get ip from host: be-old-1.example.com, err: Name or service not known 0# doris::hostname_to_ipv4(...) at be/src/util/network_util.cpp:125 1# doris::hostname_to_ip(...) at be/src/util/network_util.cpp:104 2# doris::DNSCache::_update(...) at be/src/common/status.h:494 3# doris::DNSCache::_refresh_cache() at be/src/common/status.h:380 Once per minute per stale hostname, indefinitely. Symptom B — brpc keeps reconnecting to the cached (now unreachable) IP W<date> <ts> <tid> socket.cpp:1270] Fail to wait EPOLLOUT of fd=<n>: Connection timed out [110] In our case this fires ~4 times per second, ~340K times per hour, accumulating > 3.7M occurrences over 11 days. The IPs the BE keeps trying to reach are the last successfully resolved IPs of the dropped hostnames, served back by DNSCache::_resolve_hostname() after every refresh failure. A single BE's be.WARNING grew to 634 MB in 11 days — multiplied by every BE in the cluster. Root cause be/src/util/dns_cache.cpp (master HEAD, lines 57–121): - _refresh_cache() iterates every cached hostname every 60 s and calls _update. - _update → _resolve_hostname. On resolution failure, _resolve_hostname returns the stale cached IP so callers can keep using it. That is a reasonable graceful-degradation choice. - However, the entry is never removed from the cache map. There is no failure counter, no TTL, no eviction policy. - Consequence: as long as the BE process lives, the hostname is re-resolved (and re-fails) once per minute, forever. BrpcClientCache / ClientCache keep handing the stale IP to brpc, which keeps timing out at the kernel level (ETIMEDOUT after tcp_syn_retries, ~127 s). ### What You Expected? 1. Bring up a Doris cluster (≥ 2 BEs). 2. Pick a hostname victim.example.com that points to a working BE. Issue queries / data ingestion that go through DNSCache::get (e.g. broker load, internal RPC) so the hostname enters the cache. 3. Decommission and remove the BE: DROP BACKEND "victim.example.com:9050"; 4. Delete victim.example.com from DNS (or /etc/hosts). 5. Observe be.WARNING on the other BEs. Within 1 minute the first failed to get ip from host line appears. It never goes away. ### How to Reproduce? 1. Bring up a Doris cluster (≥ 2 BEs). 2. Pick a hostname victim.example.com that points to a working BE. Issue queries / data ingestion that go through DNSCache::get (e.g. broker load, internal RPC) so the hostname enters the cache. 3. Decommission and remove the BE: DROP BACKEND "victim.example.com:9050"; 4. Delete victim.example.com from DNS (or /etc/hosts). 5. Observe be.WARNING on the other BEs. Within 1 minute the first failed to get ip from host line appears. It never goes away. ### Anything Else? _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
