wklken opened a new issue, #12436: URL: https://github.com/apache/apisix/issues/12436
### Current Behavior In some condition, when the ip of the domain changed, the apisix keep use the old ip, cause 504 gateway timeout. And it would never resume, until do `apisix reload` At the same time, dig and nslookup command return the newest ip. ### Expected Behavior apisix should detect the ip changed ### Error Logs ``` 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65: parse_domain_for_nodes(): parse_domain_for_nodes: [{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69: parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84: parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:213: parse_domain_in_route(): parse_domain_in_route | new_nodes=[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:219: parse_domain_in_route(): parse_domain_in_route | up_conf:{"timeout":{"send":30,"connect":30,"read":30},"hash_on":"vars","type":"roundrobin","parent":{"update_count":0,"modifiedIndex":5360,"orig_modifiedIndex":5360,"clean_handlers":{},"createdIndex":5360,"has_domain":true,"key":"/bk-gateway-apisix/routes/apigw.prod.2347","value":{"timeout":{"send":30,"connect":30,"read":30},"desc":"Returns anything passed in request data.","name":"apigw-prod-anything-get","labels":{"gateway.bk.tencent.com/stage":"prod","gateway.bk.tencent.com/gateway":"apigw"},"update_time":1752566944,"plugins":{"bk-proxy-rewrite":{"match_subpath":false,"uri":"/anything","subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false},"bk-resource-context":{"bk_resource_name":"anything_get","bk_resource_id":2347,"bk_resource_auth":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_a pp_required":false},"bk_resource_auth_obj":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false}}},"uris":["/api/apigw/prod/anything","/api/apigw/prod/anything/"],"upstream":{"timeout":"table: 0x7f119b810dd0","hash_on":"vars","type":"roundrobin","parent":"table: 0x7f1199322a98","original_nodes":[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],"nodes":"table: 0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table: 0x7f11693587e0"},"status":1,"id":"apigw.prod.2347","service_id":"apigw.prod.stage-4","priority":0,"methods":["GET"],"create_time":1752566944}},"original_nodes":"table: 0x7f11693587e0","nodes":"table: 0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table: 0x7f11693587e0"}, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:221: parse_domain_in_route(): parse_domain_in_route | compare result:true, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:223: parse_domain_in_route(): parse_domain_in_route | no change, use old route, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" ``` ### Steps to Reproduce 1. add a route with `route.upstream.nodes` and the `nodes[0].host = httpbin`, which is a svc in k8s, route to the httpbin service ``` $ curl -H "X-API-KEY: $admin_key" http://127.0.0.1:9180/apisix/admin/routes/apigw.prod.2347 | jq { "key": "/bk-gateway-apisix/routes/apigw.prod.2347", "modifiedIndex": 5360, "createdIndex": 5360, "value": { "timeout": { "send": 30, "connect": 30, "read": 30 }, "desc": "Returns anything passed in request data.", "name": "apigw-prod-anything-get", "update_time": 1752566944, "plugins": { "proxy-rewrite": { "method": "GET", "uri": "/anything" } }, "create_time": 1752566944, "upstream": { "timeout": { "send": 30, "connect": 30, "read": 30 }, "nodes": [ { "weight": 100, "priority": 1, "port": 80, "host": "httpbin" } ], "pass_host": "node", "scheme": "http", "type": "roundrobin" }, "labels": { "gateway.bk.tencent.com/stage": "prod", "gateway.bk.tencent.com/gateway": "apigw" }, "id": "apigw.prod.2347", "service_id": "apigw.prod.stage-4", "status": 1, "methods": [ "GET" ], "uris": [ "/api/apigw/prod/anything", "/api/apigw/prod/anything/" ] } } ``` here, the route.upstream.nodes[0].host = httpbin` 2. add `core.log.error` for debug apisix/init.lua ```lua local function parse_domain_in_route(route) local nodes = route.value.upstream.nodes local new_nodes, err = upstream_util.parse_domain_for_nodes(nodes) core.log.error("parse_domain_in_route | new_nodes=", core.json.delay_encode(new_nodes, true)) if not new_nodes then return nil, err end local up_conf = route.dns_value and route.dns_value.upstream core.log.error("parse_domain_in_route | up_conf:", core.json.delay_encode(up_conf, true)) local ok = upstream_util.compare_upstream_node(up_conf, new_nodes) core.log.error("parse_domain_in_route | compare result:", ok) if ok then core.log.error("parse_domain_in_route | no change, use old route") return route end -- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result -- has changed -- Here we copy the whole route instead of part of it, -- so that we can avoid going back from route.value to route during copying. route.dns_value = core.table.deepcopy(route).value route.dns_value.upstream.nodes = new_nodes core.log.info("parse route which contain domain: ", core.json.delay_encode(route, true)) return route end ``` and apisix/utils/upstream.lua ```lua local function parse_domain_for_nodes(nodes) core.log.error("parse_domain_for_nodes: ", core.json.delay_encode(nodes, true)) local new_nodes = core.table.new(#nodes, 0) for _, node in ipairs(nodes) do local host = node.host core.log.error("parse_domain_for_nodes: host=", host) if not ipmatcher.parse_ipv4(host) and not ipmatcher.parse_ipv6(host) then local ip, err = core.resolver.parse_domain(host) if ip then local new_node = core.table.clone(node) new_node.host = ip new_node.domain = host core.table.insert(new_nodes, new_node) end if err then core.log.error("dns resolver domain: ", host, " error: ", err) end else core.log.error("parse_domain_for_nodes: add the node back") core.table.insert(new_nodes, node) end end return new_nodes end _M.parse_domain_for_nodes = parse_domain_for_nodes ``` 5. apisix reload and update routes in etcd, trigger `config_etcd.lua:389: sync_data()` 6. at the same time, delete the httpbin service and kubectl apply it again (the cluster ip would be changed) 【not 100% Reproducible】 7. curl it ----- according to the error.log, 1. the `parse_domain-for_nodes` args 1 is `[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}]`, the host is a ip here 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65: parse_domain_for_nodes(): parse_domain_for_nodes: [{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}], client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" 2. while it's not a domain, so it would not `core.resolver.parse_domain(host)` 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69: parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" 3. then it been added back 2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84: parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com" ------ so the worker would never detect the ip changes, until `apisix reload` ### Environment - APISIX version (run `apisix version`): 3.2.1 - Operating system (run `uname -a`): - OpenResty / Nginx version (run `openresty -V` or `nginx -V`): - etcd version, if relevant (run `curl http://127.0.0.1:9090/v1/server_info`): - APISIX Dashboard version, if relevant: - Plugin runner version, for issues related to plugin runners: - LuaRocks version, for installation issues (run `luarocks --version`): -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@apisix.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org