wklken opened a new issue, #12436:
URL: https://github.com/apache/apisix/issues/12436

   ### Current Behavior
   
   In some condition, when the ip of the domain changed, the apisix keep use 
the old ip, cause 504 gateway timeout.
   
   And it would never resume, until do `apisix reload`
   
   At the same time, dig and nslookup command return the newest ip.
   
   
   ### Expected Behavior
   
   apisix should detect the ip changed
   
   ### Error Logs
   
   ```
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65: 
parse_domain_for_nodes(): parse_domain_for_nodes: 
[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],
 client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything 
HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69: 
parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client: 
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", 
host: "bkapi.paasv3-dev.woa.com"
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84: 
parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client: 
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", 
host: "bkapi.paasv3-dev.woa.com"
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:213: 
parse_domain_in_route(): parse_domain_in_route | 
new_nodes=[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],
 client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything 
HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:219: 
parse_domain_in_route(): parse_domain_in_route | 
up_conf:{"timeout":{"send":30,"connect":30,"read":30},"hash_on":"vars","type":"roundrobin","parent":{"update_count":0,"modifiedIndex":5360,"orig_modifiedIndex":5360,"clean_handlers":{},"createdIndex":5360,"has_domain":true,"key":"/bk-gateway-apisix/routes/apigw.prod.2347","value":{"timeout":{"send":30,"connect":30,"read":30},"desc":"Returns
 anything passed in request 
data.","name":"apigw-prod-anything-get","labels":{"gateway.bk.tencent.com/stage":"prod","gateway.bk.tencent.com/gateway":"apigw"},"update_time":1752566944,"plugins":{"bk-proxy-rewrite":{"match_subpath":false,"uri":"/anything","subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false},"bk-resource-context":{"bk_resource_name":"anything_get","bk_resource_id":2347,"bk_resource_auth":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_a
 
pp_required":false},"bk_resource_auth_obj":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false}}},"uris":["/api/apigw/prod/anything","/api/apigw/prod/anything/"],"upstream":{"timeout":"table:
 0x7f119b810dd0","hash_on":"vars","type":"roundrobin","parent":"table: 
0x7f1199322a98","original_nodes":[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],"nodes":"table:
 0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table: 
0x7f11693587e0"},"status":1,"id":"apigw.prod.2347","service_id":"apigw.prod.stage-4","priority":0,"methods":["GET"],"create_time":1752566944}},"original_nodes":"table:
 0x7f11693587e0","nodes":"table: 
0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table: 
0x7f11693587e0"}, client: 10.244.2.240, server: _, request: "GET 
/api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:221: 
parse_domain_in_route(): parse_domain_in_route | compare result:true, client: 
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", 
host: "bkapi.paasv3-dev.woa.com"
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:223: 
parse_domain_in_route(): parse_domain_in_route | no change, use old route, 
client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything 
HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
   ```
   
   ### Steps to Reproduce
   
   1. add a route with `route.upstream.nodes` and the `nodes[0].host = 
httpbin`, which is a svc in k8s, route to the httpbin service
   
   ```
   $ curl -H "X-API-KEY: $admin_key"  
http://127.0.0.1:9180/apisix/admin/routes/apigw.prod.2347 | jq
   {
     "key": "/bk-gateway-apisix/routes/apigw.prod.2347",
     "modifiedIndex": 5360,
     "createdIndex": 5360,
     "value": {
       "timeout": {
         "send": 30,
         "connect": 30,
         "read": 30
       },
       "desc": "Returns anything passed in request data.",
       "name": "apigw-prod-anything-get",
       "update_time": 1752566944,
       "plugins": {
         "proxy-rewrite": {
           "method": "GET",
           "uri": "/anything"
         }
       },
       "create_time": 1752566944,
       "upstream": {
         "timeout": {
           "send": 30,
           "connect": 30,
           "read": 30
         },
         "nodes": [
           {
             "weight": 100,
             "priority": 1,
             "port": 80,
             "host": "httpbin"
           }
         ],
         "pass_host": "node",
         "scheme": "http",
         "type": "roundrobin"
       },
       "labels": {
         "gateway.bk.tencent.com/stage": "prod",
         "gateway.bk.tencent.com/gateway": "apigw"
       },
       "id": "apigw.prod.2347",
       "service_id": "apigw.prod.stage-4",
       "status": 1,
       "methods": [
         "GET"
       ],
       "uris": [
         "/api/apigw/prod/anything",
         "/api/apigw/prod/anything/"
       ]
     }
   }
   ```
   
   here, the route.upstream.nodes[0].host = httpbin`
   
   2. add `core.log.error` for debug
   
   apisix/init.lua
   
   ```lua
   local function parse_domain_in_route(route)
       local nodes = route.value.upstream.nodes
       local new_nodes, err = upstream_util.parse_domain_for_nodes(nodes)
       core.log.error("parse_domain_in_route | new_nodes=", 
core.json.delay_encode(new_nodes, true))
       if not new_nodes then
           return nil, err
       end
   
       local up_conf = route.dns_value and route.dns_value.upstream
       core.log.error("parse_domain_in_route | up_conf:", 
core.json.delay_encode(up_conf, true))
       local ok = upstream_util.compare_upstream_node(up_conf, new_nodes)
       core.log.error("parse_domain_in_route | compare result:", ok)
       if ok then
           core.log.error("parse_domain_in_route | no change, use old route")
           return route
       end
   
       -- don't modify the modifiedIndex to avoid plugin cache miss because of 
DNS resolve result
       -- has changed
   
       -- Here we copy the whole route instead of part of it,
       -- so that we can avoid going back from route.value to route during 
copying.
       route.dns_value = core.table.deepcopy(route).value
       route.dns_value.upstream.nodes = new_nodes
       core.log.info("parse route which contain domain: ",
                     core.json.delay_encode(route, true))
       return route
   end
   ```
   
   and
   
   apisix/utils/upstream.lua 
   
   ```lua
   local function parse_domain_for_nodes(nodes)
       core.log.error("parse_domain_for_nodes: ", core.json.delay_encode(nodes, 
true))
       local new_nodes = core.table.new(#nodes, 0)
       for _, node in ipairs(nodes) do
           local host = node.host
           core.log.error("parse_domain_for_nodes: host=", host)
           if not ipmatcher.parse_ipv4(host) and
                   not ipmatcher.parse_ipv6(host) then
               local ip, err = core.resolver.parse_domain(host)
               if ip then
                   local new_node = core.table.clone(node)
                   new_node.host = ip
                   new_node.domain = host
                   core.table.insert(new_nodes, new_node)
               end
   
               if err then
                   core.log.error("dns resolver domain: ", host, " error: ", 
err)
               end
           else
               core.log.error("parse_domain_for_nodes: add the node back")
               core.table.insert(new_nodes, node)
           end
       end
   
       return new_nodes
   end
   _M.parse_domain_for_nodes = parse_domain_for_nodes
   ```
   
   
   5. apisix reload and update routes in etcd, trigger `config_etcd.lua:389: 
sync_data()`
   6. at the same time, delete the httpbin service and kubectl apply it again 
(the cluster ip would be changed)  【not 100% Reproducible】
   7. curl it
   
   -----
   
   according to the error.log, 
   
   1. the `parse_domain-for_nodes` args 1 is 
`[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}]`,
 the host is a ip here
   
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65: 
parse_domain_for_nodes(): parse_domain_for_nodes: 
[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],
 client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything 
HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
   
   2. while it's not a domain, so it would not 
`core.resolver.parse_domain(host)`
   
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69: 
parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client: 
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", 
host: "bkapi.paasv3-dev.woa.com"
   
   3. then it been added back
   
   2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84: 
parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client: 
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1", 
host: "bkapi.paasv3-dev.woa.com"
   
   ------
   
   so the worker would never detect the ip changes, until `apisix reload`
   
   
   
   
   
   
   ### Environment
   
   - APISIX version (run `apisix version`): 3.2.1
   - Operating system (run `uname -a`):
   - OpenResty / Nginx version (run `openresty -V` or `nginx -V`):
   - etcd version, if relevant (run `curl 
http://127.0.0.1:9090/v1/server_info`):
   - APISIX Dashboard version, if relevant:
   - Plugin runner version, for issues related to plugin runners:
   - LuaRocks version, for installation issues (run `luarocks --version`):
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@apisix.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to