Copilot commented on code in PR #12867:
URL: https://github.com/apache/apisix/pull/12867#discussion_r2726744450


##########
apisix/discovery/nacos/init.lua:
##########
@@ -350,10 +356,13 @@ local function fetch_from_host(base_uri, username, 
password, services)
     for key, nodes in pairs(nodes_cache) do
         local content = core.json.encode(nodes)
         nacos_dict:set(key, content)
+        local nodes_version = ngx.crc32_long(content)

Review Comment:
   `nodes_version` is derived from `core.json.encode(nodes)` output, but 
`cjson.encode` does not guarantee stable object key ordering (see 
apisix/core/json.lua:38-41). This can cause the CRC to change even when the 
nodes data is logically unchanged, defeating the version-controlled LRU cache 
and re-triggering JSON decoding. Consider generating the version from a stable 
representation (e.g., `core.json.stably_encode` and/or sorting `nodes` before 
encoding) so the version only changes when the data changes.
   ```suggestion
           local stable_content = core.json.stably_encode(nodes)
           local nodes_version = ngx.crc32_long(stable_content)
   ```



##########
apisix/discovery/nacos/init.lua:
##########
@@ -32,13 +32,19 @@ local string_sub         = string.sub
 local str_byte           = string.byte
 local str_find           = core.string.find
 local log                = core.log
+local process            = require("ngx.process")
 
 local default_weight
 local nacos_dict = ngx.shared.nacos --key: namespace_id.group_name.service_name
 if not nacos_dict then
     error("lua_shared_dict \"nacos\" not configured")
 end
 
+local nodes_lrucache = core.lrucache.new({
+    ttl = 300,  -- 5分钟TTL
+    count = 1024

Review Comment:
   The inline comment for the LRU cache TTL is in Chinese ("5分钟TTL"), while the 
surrounding codebase uses English comments. Please change it to an English 
comment (or remove it) to match repository conventions (e.g., 
apisix/discovery/kubernetes/init.lua:39-42).



##########
apisix/discovery/nacos/init.lua:
##########
@@ -419,8 +439,12 @@ function _M.init_worker()
     log.info('fetch_interval:', fetch_interval)
     access_key = local_conf.discovery.nacos.access_key
     secret_key = local_conf.discovery.nacos.secret_key
-    ngx_timer_at(0, fetch_full_registry)
-    ngx_timer_every(fetch_interval, fetch_full_registry)
+
+    -- Only privileged agent should fetch nacos registry
+    if process.type() == "privileged agent" then
+        ngx_timer_at(0, fetch_full_registry)
+        ngx_timer_every(fetch_interval, fetch_full_registry)
+    end

Review Comment:
   The behavioral change to fetch the registry only in the privileged agent and 
the new version-controlled LRU caching path are not covered by existing nacos 
discovery tests. Please add/extend tests (see t/discovery/nacos*.t) to assert 
that only one process performs nacos fetches and that workers can still resolve 
nodes across reloads / when cache versioning is in effect.



##########
apisix/discovery/nacos/init.lua:
##########
@@ -350,10 +356,13 @@ local function fetch_from_host(base_uri, username, 
password, services)
     for key, nodes in pairs(nodes_cache) do
         local content = core.json.encode(nodes)
         nacos_dict:set(key, content)
+        local nodes_version = ngx.crc32_long(content)
+        nacos_dict:set(key .. "#version", nodes_version)

Review Comment:
   Writes to the shared dict ignore return values, which can leave `key` and 
`key#version` inconsistent under shm pressure (e.g., content set fails but 
version set succeeds, or vice versa). This can cause workers to treat a service 
as missing or read mismatched data. Please follow the established pattern used 
in other discovery modules (e.g., apisix/discovery/tars/init.lua:78-93): use 
`safe_set` (or check `set` return values), and on failure, avoid publishing the 
version without content (and/or roll back by deleting the companion key).
   ```suggestion
           local ok, err = nacos_dict:safe_set(key, content)
           if not ok then
               log.error("failed to set nacos discovery content for key: ", key,
                         ", error: ", err)
               -- avoid leaving an inconsistent state
               nacos_dict:delete(key)
               nacos_dict:delete(key .. "#version")
           else
               local nodes_version = ngx.crc32_long(content)
               local ok_ver, err_ver = nacos_dict:safe_set(key .. "#version", 
nodes_version)
               if not ok_ver then
                   log.error("failed to set nacos discovery version for key: ", 
key,
                             ", error: ", err_ver)
                   -- rollback content to keep key and key#version consistent
                   nacos_dict:delete(key)
                   nacos_dict:delete(key .. "#version")
               end
           end
   ```



##########
apisix/discovery/nacos/init.lua:
##########
@@ -396,19 +405,30 @@ local function fetch_full_registry(premature)
 end
 
 
+local function load_nodes_from_dict(key)
+    local value = nacos_dict:get(key)
+    if not value then
+        core.log.error("nacos service not found: ", key)
+        return nil
+    end
+    local nodes = core.json.decode(value)
+    return nodes
+end
+
 function _M.nodes(service_name, discovery_args)
     local namespace_id = discovery_args and
             discovery_args.namespace_id or default_namespace_id
     local group_name = discovery_args
             and discovery_args.group_name or default_group_name
     local key = get_key(namespace_id, group_name, service_name)
-    local value = nacos_dict:get(key)
-    if not value then
-        core.log.error("nacos service not found: ", service_name)
+
+    local nodes_version = nacos_dict:get(key .. "#version")
+    if not nodes_version then
+        core.log.error("nacos service version not found: ", key)
         return nil

Review Comment:
   `_M.nodes` now hard-requires `key#version` and returns nil if it's missing. 
On hot reload/upgrade, the shared dict can still contain the previous `key` 
value without a corresponding `#version` entry, causing transient 503s until 
the privileged agent refreshes. It can also happen if the version write fails. 
To keep behavior backward compatible and more robust, consider falling back to 
reading/decoding `key` when `#version` is absent (and optionally backfilling 
the version).
   ```suggestion
           -- Fallback for legacy / transient states where only `key` exists
           core.log.warn("nacos service version not found, fallback to legacy 
key: ", key)
   
           local nodes = load_nodes_from_dict(key)
           if not nodes then
               -- No data under legacy key either; keep existing behavior
               return nil
           end
   
           -- Backfill a default version so subsequent lookups are consistent
           nodes_version = 0
           local ok, err = nacos_dict:safe_set(key .. "#version", nodes_version)
           if not ok then
               core.log.warn("failed to backfill nacos service version for key: 
",
                             key, ", err: ", err)
           end
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to